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ABSTRACT 


Thompson,  Daniel  B.  M.S.,  Department  of  Computer  Science, 
Wright  State  University,  1987.  A  Multiprocessor  Avionics 
System  For  An  Unmanned  Research  Vehicle. 

The  Air  Force  Flight  Dynamics  Laboratory  is  developing  a  new 
Unmanned  Research  Vehicle  (URV)  to  support  low  cost/risk  in- 
house  flight  tests  of  advanced  flight  control  concepts. 
Implicit  to  the  development  of  the  testbed  is  the  addition 
of  an  advanced  on-board  avionics  system  to  support 
computationally  intensive  embedded  tasks.  As  the  first  phase 
of  the  development  of  this  avionics  system,  a  prototype 
multiprocessor  system  and  operating  system  software  have 
been  designed  and  tested.  As  demonstration  of  its 
capabilities,  the  prototype  implements  a  control  mixer 
algorithm  for  control  surface  reconfiguration  in  the  event 
of  failure.  The  analysis,  design,  development,  and  test 
procedures  and  results  of  the  research  of  the  prototype  are 
described.  Areas  of  further  research  in  the  following  phases 
of  development  are  also  discussed. 
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PREFACE 


Parallel  processing  research  is  still  pretty  much  in 
its  infancy.  Much  "hype"  surrounds  many  of  the  efforts  to 
date.  Promises  of  high  performance  gains  with  unbelievable 
processor  utilization  efficiencies  are  not  uncommon.  Still, 
many  potential  applications  exist,  and  many  varying 
solutions  to  the  problems  are  possible. 

Serious  research  into  the  technological  area  still 
needs  to  be  performed,  particularly  in  the  programming 
aspect  of  the  problem.  The  best  way  to  understand  the 
technology,  and  the  problems  that  exist  in  the  application 
of  the  technology,  is  to  get  the  "hands-on"  experience  in 
the  development  and  use  of  parallel  systems.  I  am  grateful 
for  the  chance  given  to  me  by  AFWAL/FIGL  to  gain  this 
experience,  both  in  this  thesis  effort  and  in  my  normal  job 
duties . 

Many  varying  disciplines  are  involved  in  a  development 
effort  such  as  this.  I  thank  all  those  persons  who  provided 
advise  or  help  in  those  areas  where  I  needed  it.  Not  all  can 
be  mentioned  here,  but  certainly  all  are  appreciated. 
Without  them,  this  project  could  not  have  been  a  success.  My 
sincerest  gratitude  to: 

First  and  foremost,  Dr.  Kuldip  Rattan,  who  acted  not  only  as 
my  thesis  advisor,  but  also  served  as  consultant  for  the 
control  mixer  applications  functions. 

Tom  Roesle,  who  helped  to  construct  the  VME  rack  and  power 


supply  system,  and  procure  the  necessary  parts. 

Doug  Roy,  who  provided  the  necessary  URV  and  8061  autopilot 
information  and  helped  set  up  the  simulation  runs. 

Dr.  R.  D.  Dixon  and  Dr.  Alastair  McAulay,  who  served  on  my 
thesis  committee  and  provided  valuable  advice. 

And  last,  but  not  least,  the  current  generation 
"Microteers" :  Don  Pogoda,  Mike  Rottman,  Tom  Dermis,  Jeff 

Mangen,  and  Vince  Crum,  who  provided  draft  reviews,  hardware 
design  advice,  and  idea  critiques;  and  put  up  with  me  when 


(1)  Introduction 

The  Unmanned  Research  Vehicle  (URV)  in-house  program  at 
the  Air  Force  Flight  Dynamics  Laboratory  (AFWAL/FI) ,  Wright 
Patterson  Air  Force  Base  is  developing  a  new  research 
testbed  to  provide  a  wide  range  of  capabilities  in  support 
of  advanced,  low  cost  flight  testing  of  flight  control 
concepts.  Implicit  to  the  development  of  the  testbed 
vehicle,  hereafter  referred  to  as  TN21,  is  the  addition  of 
an  advanced  on-board  avionics  system  to  support 
computationally  intensive  embedded  tests.  As  it  is  with  the 
rest  of  the  vehicle,  the  avionics  system  is  to  be  developed 
for  high  capability  at  low  cost. 

Single  processor  architectures,  such  as  the  one  used  in 
the  current  URV  system,  can  be  sufficient  for  basic 
autopilot  control,  but  lack  the  necessary  throughput 
potential  for  advanced  embedded  tests  like  those  envisioned 
for  future  URV  applications.  The  capabilities  of 
microprocessors  and  microcontrollers  are  rapidly  improving; 
however,  the  demands  on  digital  systems  are  outracing  this 
growth  in  improvement.  For  example,  adaptive  control 
algorithms  and  artificial  intelligence  (AI)  are  likely  to 
drive  throughput  requirements  an  order  of  magnitude  or  more 
beyond  previous  generation  requirements.  In  contrast,  next 
generation  processors  can  only  be  expected  to  be  two  to  four 
times  the  performance  of  their  predecessors  [1,2,3].  As  a 
result,  multiple  processor  architectures  will  be  required  to 
meet  the  goals  of  advanced  system  needs  and  tests. 
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Multiprocessor  systems  are  being  utilized  in  or 
researched  for  many  varied  applications,  including  flight 
control  [4,5,6].  The  technology,  though  not  yet  mature, 
carries  many  possibilities,  the  most  obvious  being  high 
speed  computation.  Many  multiprocessor  architectures  have 
been  proposed  [7,8,9];  however,  no  one  architecture  has 
emerged  as  being  superior  to  the  others  over  a  wide  range  of 
applications.  At  present,  the  application  dictates  the 
architecture  used. 

The  purpose  of  this  thesis  project  is  to  exploit 
advances  in  microprocessor  and  memory  technology,  bussing 
systems,  and  real-time  multitasking  operating  systems 
techniques  to  develop  a  multiprocessor  architecture 
appropriate  for  application  to  a  URV  system.  This  research 
will  allow  low  cost  flight  testing  of  concepts  which 
heretofore  required  high  cost/high  risk  flight  tests  on 
expensive  manned  systems.  To  meet  the  goal,  a  first  phase 
effort  has  been  performed  to  develop  and  demonstrate  a 
multiprocessor  system  suitable  for  use  on  the  proposed  URV. 
This  system  is  a  prototype,  not  the  actual  flight-worthy 
system.  Not  all  aspects  of  the  final  URV  avionics  system 
have  been  designed  and  implemented.  This  work  was  focused  on 
the  main  "computing  engine"  of  the  multiprocessor  system  and 
its  associated  operating  system  software.  This  focused, 
multiphase  design  strategy  was  taken  to  provide  short  term, 
low  cost  results  with  limited  manpower.  As  such,  development 
time,  cost,  and  use  of  available  laboratory  resources  were 
important  drivers. 

The  prototype  multiprocessor  system  has  been  developed 
and  demonstrated  incorporating  near-state-of-the-art 
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microprocessors,  coprocessors,  and  interconnect 
technologies.  A  real  time  multiprocessor/multitasking 
operating  system  ( RTMOS )  has  been  designed  and  implemented 
to  coordinate  the  parallel  tasks  and  data  exchanges.  To 
demonstrate  the  capabilities  of  the  prototype,  a  set  of 
applications  tasks,  implementing  a  control  mixer  algorithm 
for  failed  control  surface  reconfiguration,  was  developed. 

To  begin  the  description  of  the  development  of  the 
research  and  prototype,  Chapter  2  gives  a  background 
discussion  of  the  related  URV  and  Continuously  Reconfiguring 
Multi-Microprocessor  Flight  Control  System  (CRMMFCS) 
programs.  Chapter  3  covers  the  requirements  driving  the 
design  of  the  multiprocessor  system.  Chapters  4  and  5 
discuss  the  hardware  and  software  design  decisions 
respectively.  Chapter  6  addresses  the  approach  taken  to 
develop  and  test  the  laboratory  prototype.  Chapter  7 
concludes  the  technical  discussion  with  prototype 
performance  and  demonstration  results. 


(2)  Background 

For  several  years,  the  Control  Systems  Development 
Branch  (AFWAL/FIGL)  of  the  Air  Force  Flight  Dynamics 
Laboratory's  Flight  Control  Division  has  been  performing 
research  utilizing  Unmanned  Research  Vehicles  <URV)  .  Using 
an  automotive  emissions  and  fuel  economy  processor,  the 
Intel/Ford  8061,  the  URV  in-house  program  has  developed  a 
low  cost  digital  autopilot  which  has  provided  significant 
size,  weight,  and  power  savings  over  its  predecessor  system. 
The  extensive  input/output  (I/O)  capabilities  of  the  8061 
make  it  ideal  as  a  "single  chip"  controller  in  such 
applications  [10,11],  Included  in  these  capabilities  are 
thirteen  analog-to-digital  (A/D)  conversion  inputs.  eight 
high  speed  inputs,  and  ten  high  speed  outputs  which  can  be 
used  for  pulse  width  modulated  signals. 

The  URV  has  progressed  to  its  current  state  as  a  low 
cost/risk  in-house  flight  testbed  for  research  projects  at 
the  Flight  Dynamics  Laboratory.  However,  due  to  the  limited 
computational  capabilities  of  the  single  16  bit  processor  of 
the  8061,  many  new  proposed  tests  must  be  run  on  a  more 
powerful  ground-based  computer  system,  with  control  commands 
uplinked  to  the  URV  via  its  telemetry  system.  This  process 
suffers  from  several  limitations,  the  most  critical  being 
the  relatively  slow  uplink  transmission  speed. 

In  its  role  as  a  low  cost/risk  flight  testbed,  the  URV 
has  been  very  successful.  The  capabilities  of  the  URV 
system,  however,  must  expand  to  meet.  its  potential 
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applications  if  it  is  to  continue  to  serve  as  a  useful  in- 
house  research  tool.  The  current  airframe,  originally 
designed  as  a  drone  with  specific  mission  requirements,  is 
limited  in  performance  and  adaptability.  The  airframe  is 
relatively  heavy,  causing  a  high  stall  speed.  The  bulkhead 
is  such  that  payload  and  electronics  space  is  extremely 
limited.  The  use  of  the  8061  allows  digital  autopilot 
capabilities  to  be  placed  in  the  small  space  available,  but 
no  expandability  exists  to  allow  for  growth  into  more 
advanced  embedded  tests. 

To  correct  these  shortcomings,  a  new  URV  testbed  system 
(TN21)  has  been  initiated  (Figure  2.1).  The  design  will 
incorporate  a  modular  airframe  structure  that  will  allow 
different  configurations,  such  as  varying  wing  sizes,  to  be 
implemented  around  the  baseline  design.  The  aircraft  will  be 
made  of  lighter  materials  and  will  make  better  use  of  space 
to  create  less  wing  loading  and  greater  maneuverability.  The 
design  will  also  create  a  significantly  larger  payload  bay 
which  can  support  more  electronics  and  cargo.  New  control 
surfaces  and  an  overall  improved  aerodynamic  design  will 
allow  for  a  wider  range  of  applications  to  be  flight  tested 
on  the  URV.  In  short,  TN21  will  be  designed  from  the  onset 
to  be  a  flexible  tool  for  low  cost,  high  payoff  flight 
tests. 

In  order  to  make  the  best  use  of  the  performance  and 
flexibility  capabilities  of  TN21,  an  avionics  system  with 
greater  capabilities  than  the  current  autopilot  is  required. 
The  increased  space  for  embedded  electronics  has  opened  the 
possibility  of  an  advanced,  yet  low  cost,  control  system 
utilizing  multiple  microprocessors  and  expanded  memory.  The 


system  should  not  only  be  able  to  handle  the  real  time 
response  needs  of  advanced  control  laws  and  the  changing 
characteristics  of  a  flexible  and  reconf igurable  airframe, 
but  must  also  be  applicable  to  a  wide  variety  of  test 
problems.  Already  a  wide  spectrum  of  potential  applications 
exists.  Among  these  are  control  law  reconfiguration  around 
failed  surfaces,  fault  tolerant  hardware  and  software 
techniques,  artificial  intelligence,  and  the  application  of 
High  Order  Languages  (HOL)  such  as  ADA  to  real  time  control. 

AFWAL/FIGL  has  accumulated  the  background  knowledge  and 
experience  necessary  for  the  development  of  such  a  system. 
Concurrent  with  the  URV  development  work,  AFWAL/FIGL  has 
conducted  research  in  the  areas  of  microprocessor-based 
multiprocessor  systems  and  parallel  processing  as  applied  to 
flight  control  and  vehicle  management  systems.  The 
Continuously  Reconfiguring  Multi-Microprocessor  Flight 
Control  System  (CRMMFCS)  in-house  project  (1980-83)  produced 
several  unique  concepts  in  fault  tolerance  and  parallel 
processing  and  a  successful  laboratory  demonstration  system 
[4].  This  program  has  since  spawned  further  research  into 
microprocessor  applications  in  flight  control  and  related 
areas,  including  the  current  Advanced  Multiprocessor  Control 
Architecture  Definition  (AMCAD)  in-house  project  [5].  These 
programs  laid  the  groundwork  for  this  thesis  research  by 
providing  valuable  hands-on  experience  in  the  design, 
development,  troubleshooting,  and  programming  of 
multiprocessor  laboratory  systems  nsing  microprocessor 
emulation  systems,  logic  analyzers,  and  support  software. 
Lessons  learned  during  these  other  development  efforts  have 


been  applied  to  allow  effective  integration  of  the 
technology  area  to  a  low  cost  research  testbed  vehicle. 


(3)  Design  Considerations  of  a  Multiprocessor  URV  Avionics 
System 

Because  of  the  wide  range  of  potential  applications  of 
such  a  URV  system,  a  quantifiable  performance  measure  of  the 
avionics  system  is  difficult  to  pinpoint  as  a  baseline. 
Almost  none  of  the  tests  identified  in  the  previous  chapter 
have  been  formalized  into  planned  tests  to  this  point.  From 
this,  flexibility  appears  to  be  the  critical  design  goal.  A 
multiple  processor  configuration  is  desired,  but  the  number 
required  is  a  function  of  individual  processor  capability, 
communications  throughtput,  and  applications  computation 
requirements.  Without  the  last,  the  best  design  is  the  one 
that  allows  for  minimal  multiprocessor  configurations  with 
growth  potential  to  meet  needs. 

Size,  weight,  and  power  often  drive  the  limits  of 
growth  potential.  For  the  TN21  system,  however,  these 
aspects  have  minimal  impact.  The  available  electronics  space 
has  been  approximated  at  9  inches  by  34  inches  by  11  inches. 
Current  microprocessor,  memory,  and  interface  logic 
densities  allow  considerable  computing  power  to  be  included 
in  the  available  space.  The  on-board  power  system  will  be 
able  to  supply  more  than  50  amps  at  24  volts,  and  again, 
current  device  technology  allows  operation  well  below  this 
constraint.  A  design  constraint  for  weight  is  currently 
unavailable.  However,  all  indications  are  that  the  weight  of 
electronics  and  packaging  filling  the  available  space  will 
be  within  the  limits  of  the  aerodynamic  design. 


The  input  and  output  (I/O)  requirements  perceived  for 
TN21  do  not  increase  significantly  over  those  of  existing 
URV.  Although  the  figures  will  vary  due  to  changing 
configurations,  the  baseline  requirement  is  8  analog  sensors 
and  10  pulse  width  modulation  driven  servos.  The  8061 
processor  used  in  the  previous  URV  system  contains 
internally  the  capabilities  to  handle  the  I/O  requirements 
of  the  new  system.  As  will  be  addressed  further  in  the  next 
chapter,  the  problem  with  using  the  8061  as  the  processor 
type  for  the  multiprocessor  is  its  lack  of  general  purpose 
applicability.  Without  floating  point  support,  operating 
system  aiding  instructions,  and  a  conventional  memory 
interface,  the  8061  lacks  the  capabilities  to  make  it  a 
flexible  computation  base  in  a  multiprocessor  environment. 

Reliability  and  safety  are  concerns  in  a  design  such  as 
this.  In  manned  systems,  fault  tolerance  is  required  to 
ensure  safety  of  flight  and  to  avoid  the  loss  of  expensive 
systems.  The  URV  is  a  unique  flight  test  bed  in  that  the 
airframe  and  equipment  are  relatively  inexpensive.  The 
aircraft  are  flown  in  controlled  areas  where  risk  is 
minimized.  A  full  suite  of  fault  tolerance  mechanisms, 
including  the  capability  to  maintain  testing  after  computing 
resources  fail,  is  an  overly  optimistic,  if  not  self 
defeating,  design  goal.  The  inclusion  of  a  complete  set  of 
fault  tolerance  mechanisms  would  only  serve  to  drain 
available  resources  and  drive  the  design  and  utilization 
costs  to  an  unacceptable  level.  Still,  the  use  of  more 
complex,  multiprocessor  configurations  creates  a  greater 
chance  of  individual  component  failure.  As  with  the  previous 
URV  system,  the  answer  lies  in  a  fail  safe  mechanism  which 


allows  the  pilot  of  the  vehicle  to  regain  direct  control  of 


the  vehicle  in  an  unassisted  mode  of  operation  or  degrade 
the  aircraft  control  to  a  slow,  circling  return  to  the 


ground.  As  such,  the  design  criteria  allows  for  a  means  to 


bypass  the  multiprocessor  system  upon  detection,  by  the 


system  or  pilot,  of  a  failure  in  the  control  system. 


In  summary,  the  multiprocessor  system  should  be 


flexible  to  meet  changing  airframe  and  test  configurations, 


utilize  processors  of  general  purpose  applicability,  and 
implement  only  a  minimal  set  of  fault  tolerance 


capabilities.  Most  importantly,  however,  the  system  must 
provide  efficient,  high  speed  computing  at  a  low  overall 


system  development  and  maintenence  cost.  By  meeting  these 


requirements,  the  multiprocessor  avionics  system  will 


support  the  goals  of  the  new  URV  research  testbed. 
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(4)  Overall  Goals  to  Architecture  Specification 

The  three  basic  areas  comprising  the  avionics  system 
can  be  categorized  as  computation,  interprocessor 
communications,  and  I/O.  This  statement  is  not  to  imply  that 
the  three  areas  need  be  distinct;  in  fact,  in  some 
implementations  quite  a  bit  of  overlap  exists.  In  this 
design,  however,  the  areas  are  best  handled  separately.  The 
following  sections  describe  the  specification  of  the  three 
areas  and  the  resulting  TN21  multiprocessor  configuration. 

(4.1)  Computation  Area 

The  use  of  the  URV  as  a  general  test-bed  for  low 
cost/risk  flight  testing  dictates  the  use  of  a  homogeneous 
set  of  processors  of  a  type  that  can  handle  a  wide  variety 
of  jobs.  The  baseline  requirement  includes  capabilities  to 
handle  real  time  operating  system  functions,  16  or  32  bit 
integer  processing,  floating  point  operations,  logical  bit 
manipulations,  and  a  variety  of  memory  addressing  modes. 
These  are  not  difficult  criteria  to  meet;  in  fact  several 
current,  readily  accessable  microprocessors  exist  which 
contain  the  above  capabilities.  Examples  include  the 
Motorola  68000  family,  the  Intel  8086/286/386  family,  the 
National  32x32  family,  and  the  Zilog  Z8000/Z80000  family.  To 
complicate  matters,  none  of  the  acceptable  microprocessor 
families  has  a  clear  technologicial  advantage  over  the 
others  as  applied  to  the  wide  range  of  applications.  The 
8061,  however,  does  not  meet  the  stated  goals.  The  chip  was 


designed  for  microcontrol ler  applications,  not  for  general 
purpose  computing. 

As  a  result,  the  processor  of  choice  was  picked  due  to 
the  development  support  available  in  the  Microprocessor 
Laboratory  at  AFWAL/FIGL.  This  facility  utilizes  MC68000  in- 
circuit  emulators  and  logic  analysis  capabilities  in  a 
variety  of  programs  and  has  assemblers  for  the  development 
of  MC68000  code.  The  use  of  these  capabilities  eases  the 
development  cycle  and  results  in  a  more  reliable  designs 
which  take  less  time  to  develop.  The  choice  of  the  MC68000, 
therefore,  provides  the  required  capabilities  while  allowing 
for  low  cost  development. 

(4.2)  Interprocessor  Communications  Area 

Many  schemes  have  been  developed  or  proposed  for  the 
interconnection  of  multiprocessors  [7,8,9].  These  range  from 
a  simple,  single  bus  to  a  complex,  multiple  interconnection 
network.  Figure  4.1  shows  some  of  the  possibilities. 
Intuitively,  the  development  of  a  system  for  a  URV  would  not 
include  a  complex,  difficult  to  develop  and  test, 
interconnection  scheme.  In  fact,  the  wide  availability  of 
commercial  busses  and  parts  leads  to  the  choice  of  a  bus. 
Two  questions  must  first  be  addressed  before  finalizing  this 
choice . 

The  reliability  of  a  single  thread  bus  can  be  seen  as  a 
potential  problem.  However,  as  addressed  in  the  previous 
chapter,  multiple  redundant  channels  of  communication, 
providing  a  fault  tolerant  capability,  are  not  required  in 
this  URV  system.  Although  a  commercial  bus  with  multiple 
processors  attached  is  more  likely  to  fail  than  the  single 
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processor  URV  system  of  before,  a  fail-safe  m«<  1  an t 
around  the  multiprocessor  "network"  car.  prov  L  ie  the 
reliability  required. 

A  more  troubling  question  involves  a  common  concern  in 
single  bus  multiprocessor  systems:  transfer  bottlenecks.  It 
is  commonly  accepted  that  a  practical  limit  exists  on  the 
number  of  processors  attached  to  a  single  bus.  Beyond  this 
limit,  the  addition  of  more  processors  degrades,  rather  thui 
improves,  system  performance  by  forcinq  more  resource 
sharing.  Even  before  this  limit  is  reached,  degradation  of 
added  processor  power  can  be  significant.  Figure  4.2  shows 
a  realistic  graph  of  processor  numbers  versus  effective 
processing  power.  Note  that  the  ideal  limit  is  a  linear 
increase  in  processing  power  with  increased  numbers  of 
processors.  This  situation  is  not  reached  in  practice  for  a 
variety  of  reasons;  the  overhead  of  achieving  parallel 
interactions  being  a  major  cause.  The  result  is  a  reduction 
in  the  added  percentage  of  processor  power  as  more 
processors  are  included  in  the  system.  However,  if  the 
number  of  processors  attached  to  a  single  bus  is  kept  low 
enough  and  the  speed  of  the  bus  is  sufficient,  a  single  bus 
can  be  used  to  support  multiple  processors.  Ideal 
parallelism  may  not  be  achievable,  but  effective  gains  in 
computing  power  can  be  realized. 

Given  the  sizing  limits  from  the  previous  chapter,  six 
to  eight  processors  appear  to  be  the  practical  upper  end  for 
the  number  of  processors  to  be  used.  This  limit  would  appear 
to  be  sufficient  for  all  forseeable  URV  embedded 
applications.  Keeping  the  granularity  of  computation  to  a 


coarse  grained  level  with  transfers  kept  to  a  minimum,  six 
to  eight  processors  can  be  used  practically  in  the  URV 
system  with  a  single  bus  interconnect. 


As  with  the  choice  of  processors,  there  are  multiple, 
readily  available  options  for  parallel  busses  (Figure  4.3). 
None  of  these  options  is  unworkable  with  the  MC68000. 
However,  the  VME  bus  is  the  most  compatible,  being  developed 
by  Motorola  with  a  protocol  almost  identical  to  the  MC68000. 
Also,  far  more  available  components  based  on  the  MC68000 
exist  for  the  VME  bus  than  there  are  for  any  other  parallel 
bus.  The  VME  bus  meets  the  requirement  for  high  speed 
transfer  with  a  near  state-of-the-art  40  Mbytes  per  second 
peak  transfer  rate  [12].  The  bus  specification  also  includes 
features  of  access  fairness  and  interprocessor  interrupts. 
The  asynchronous  transfer  protocol  allows  the  addition  of 
components  which  operate  at  different  speeds  than  the  bus 
itself.  This  feature  can  be  useful  in  an  environment  where 
flexibility  is  desirable.  Also,  industry  acceptance  of  the 
bus  has  led  to  wide  availability  of  parts,  boards,  and 
assemblies.  Of  the  other  bus  options  available,  only 
Multibus  II  appears  to  have  the  wide  industry  acceptance, 
capability,  and  availability  of  components  of  the  VME  bus. 
Comparisons  of  the  two  busses  in  industry  publications 
[12,13,14]  have  produced  no  clear  advantages  for  either.  As 
such,  the  choice  of  processor  led  to  the  choice  of  the  VME 
bus  for  the  URV  multiprocessor  prototype. 

(4.3)  I/O  Area 

The  introduction  to  this  chapter  described  the  I/O 
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section  as  a  separate,  distinct  unit.  I/O  capabilities  could 
be  merged  into  the  processing  modules  of  the  multiprocessor, 
but  this  is  not  practical  from  a  number  of  standpoints. 
First,  connection  of  all  I/O  devices  (sensors,  actuators)  to 
all  processors  (Figure  4.4)  becomes  cumbersome,  and 
expandability  is  quickly  inhibited.  Dedicating  certain  I/O 
devices  to  certain  processors  limits  the  flexibility  of 
processor  utilization.  This  configuration  ccfuld  also  hinder 
expandability  and  the  programmability  of  the  system.  Also, 
the  inclusion  of  I/O  capability,  such  as  analog-to-digital 
converters,  would  severely  complicate  the  processing  module 
hardware;  creating  more  hardware  for  no  corresponding  gain 
in  processing  capability.  Usage  of  the  8061  could  solve  this 
problem;  however,  as  stated  previously,  the  8061  is 
basically  unsuitable  as  the  computation  area  processor-type 
for  the  multiprocessor  system. 

A  more  acceptable  solution  is  to  separate  the  I/O  area 
from  the  main  computational  area.  Although  the  8061  is  not 
suitable  for  a  multiprocessor  environment,  interface  to  the 
VME  bus  is  still  possible.  The  solution,  therefore,  is  to 
connect  all  I/O  to  an  8061  module  (or  multiple  modules  if 
needed)  and  program  the  8061  to  supply  the  multiprocessor 
system  with  digital  inputs  through  the  VME  bus.  Outputs  from 
the  system  can  then  be  sent  back  along  the  bus  to  the  8061 
for  pulse  width  modulated  output  to  the  servo  actuators. 

One  byproduct  of  this  choice  is  that  it  provides  a 
possible  solution  to  the  fail  safe  operation  considerations 
discussed  above.  If  provided  with  a  "bare-bones"  autopilot 
function  in  reserve,  the  8061  can  be  commanded  by  the  pilot 


to  take  control  of  the  aircraft  in  the  event  of  a  failure  in 
the  multiprocessor  section.  In  this  scenario,  all  tests 
would  be  terminated  and  the  vehicle  would  be  returned  to  the 
ground  safely.  This  mode  provides  a  reliability  no  worse 
than  that  of  the  current  autopilot. 

(4.4)  Resulting  Configuration  and  Hardware  Selection 

The  resulting  hardware  configuration  is  shown  in  Figure 
4.5.  As  noted  in  Chapter  1,  the  main  design  drivers  for  the 
prototype  were  development  time  and  cost.  These  drivers 
dictate  the  use  of  available  parts  where  possible.  The 
decision  made  was  to  purchase  as  much  of  the  hardware 
preassembled  as  possible  to  limit  debugging  time.  MC68000 
processor  boards  with  VME  interface  were  purchased  [15]. 
These  boards  contain  512  Kbytes  of  zero  wait  state  dual 
ported  dynamic  RAM  accessable  from  the  VME  interface  as 
well  as  the  MC68000  resident  on  the  board.  The  boards  also 
include  128  Kbytes  of  EPROM  and  an  interface  connector 
through  which  a  wire-wrap  board  can  be  attached  for  hardware 
expansion.  A  VME  backplane,  rack  assembly,  and  power  supply 
were  also  purchased. 

The  dual  ported  RAM  sections  on  each  board  provide  the 
means  for  interprocessor  communication.  Since  each  section 
can  be  set  up  to  be  addressed  in  a  different  memory  range  on 
the  VME  bus,  the  concatenation  of  the  sections  creates  a 
distributed  version  of  a  shared  memory.  Figure  «.6  shows  the 
logical  view  of  the  VME  addressable  memory  area. 

The  selection  of  boards  utilizing  this  communications 
technique  was  not  arbitrary.  Experience  with  multiprocessor 


system  software  at  AFWAL/FIGL  has  indicated  that  shared 
memory  techniques  represent  a  simpler  view  to  the  programmer 
and  provide  the  same  with  more  flexibility  than  message 
passing  (or  point  to  point)  techniques.  One  example  of  the 
flexibility  possible  is  the  capability  for  an  any-task-on- 
any-processor  programming  model.  This  capability  allows  the 
programmer  to  design  the  parallel  software  without  specific 
details  of  the  architecture,  such  as  the  number  of  processor 
modules.  Chapter  5  will  address  this  shared  memory 
capability.  If  required,  however,  a  message  passing  logical 
view  can  be  set  up  over  a  physical  shared  memory  design 
through  the  use  of  mailboxes  or  other  similar  data 


structures . 


(5)  Software  Specification 

The  software  for  the  TN21  avionics  system  intersects 
with  two  often  distinct  areas  of  software  development:  real 
time  control  and  multiple  processor  operating  systems.  The 
commercial  market  contains  skeletal  real  time  operating 
system  kernels  which  are  usually  designed  to  operate  on  a 
single  processor.  Development  of  multiprocessor  operating 
systems  are  often  taken  with  more  concern  towards  keeping 
processing  elements  busy  with  some  portion  of  the  task 
loading  and  coordinating  exchanges  of  messages  than  towards 
the  rapid  response  to  asynchronous  events  or  the  strict 
periodic  sample/compute/output  cycle  of  real  time  control. 
The  software  development  of  the  URV  multiprocessor  system 
has  been  focused  towards  applying  real  time  operating  system 
techniques  in  the  environment  of  a  multiprocessor  system. 

(5.1)  Techniques  of  Multiprocessor  Software 

Multiprocessor  software  can  be  viewed  in  many  ways.  In 
the  simplest  case,  each  processor  can  be  given  a  distinct 
job  to  perform,  with  little  or  no  interaction  with  the  other 
processors  of  the  system.  Each  processor's  job  is  a 
separately  specified  and  developed  piece  of  software  code. 
An  example  of  this  is  a  multiprocessor  system  supporting 
batch  processing  of  user  programs.  Each  program  submitted 
can  be  allocated  to  a  separate  processor.  This  method  is 
dependant  upon  separate,  independant  threads  of  computation, 
and  often  results  in  inefficient  use  of  processing 
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resources.  Some  processors  are  idle,  waiting  for  work,  while 
others  are  busy  performing  jobs  which  conceivably  could  be 
subdivided. 

From  this  simple  case,  many  extensions  are  possible. 
One  example  is  the  static  assignment  of  multiple, 
sequentially  ordered  jobs  per  processor.  Instead  of  having 
each  processor  perform  a  single,  complete  job,  each  now  may 
participate  in  multiple  threads  of  computation  by  executing 
subsets  of  these  threads.  This  brings  the  interactions 
closer  to  true  parallel  processing  and  allows  for  more  even 
processing  load  distribution.  The  interactions  of  the 
processors  are  still  preplanned  according  to  the  job 
scheduling.  Execution  of  program  sections  is  totally 
determined  before  run-time,  forcing  the  programmer  to  be  the 
scheduler.  Code  execution  time  is  another  consideration  the 
programmer  has  to  be  concerned  with,  so  that  exchange 
windows  can  be  met.  Although  efficiency  of  processor 
utilization  is  made  better,  coding  is  much  more  difficult. 

Single  processor  multitasking  operating  systems  can 
provide  the  basis  for  another  option.  These  have  been  used 
extensively  for  real  time  control  and  can  provide  for  the 
efficient  and  timely  scheduling  of  jobs.  The  question  is  how 
to  extend  this  type  of  model  to  multiple  processor  systems. 

Two  methods  used  are  either  to  utilize  a  single  queue 
of  tasks  or  a  single  task  resource  manager  from  which  the 
jobs  can  be  obtained.  The  latter  case  can  be  considered  a 
form  of  master/slave  processing,  where  one  processor  of  the 
system  is  dedicated  as  a  distributer  of  tasks  to  the  other 
processors.  The  former  case  requires  a  shared  memory  with 
mechanisms  for  insuring  uninterrupted  access  by  processors 


to  the  queue  during  task  acquisition.  Neither  of  these  cases 
really  follows  directly,  in  implementation,  from  the  single 
processor  multitasking  model. 

If  each  processor  is  given  its  own  queue  of  tasks  to 
perform,  the  single  processor  model  can  be  used.  The 
question  to  address  is  how  tasks  are  distributed  to  the 
individual  queues.  The  assignment  may  be  static  and 
determined  during  programming,  or  dynamic  and  changeable 
during  execution.  Also,  given  that  the  tasks  may  not  be 
prescheduled  for  exchanges,  how  are  interactions 
coordinated? 

(5.2)  Programmer  Viewpoint:  Tasks 

A  task  is  a  unit  of  software  which  may  operate  in 
parallel  with  or  sequentially  in  coordination  with  other 
tasks.  The  code  may  be  thought  of  in  the  same  general  terms 
as  a  module  or  subroutine  in  conventional  programming.  A 
multitasking  operating  system  provides  for  the  scheduling  of 
multiple  tasks  for  the  effective  operation  of  the  system. 

Tasks  are  not  often  independent.  They  interact  with 
other  tasks  to  acquire  data,  and  to  enforce  cause  and  effect 
relationships.  The  division  of  a  problem  into  a  set  of 
tasks,  therefore,  is  similiar  to  the  division  of  a 
conventional  program  into  a  hierarchy  of  subroutines. 
Software  is  organized  into  modules  of  logical  relevance,  and 
their  interfaces  to,  or  interactions  with,  other  modules  are 
specified.  The  difference  with  tasks  is  the  additional 
factor  of  time.  Certain  tasks  can  operate  at  the  same  time, 
while  others  operate  in  some  ordering  defined  by  task 


interactions.  A  mechanism  to  enforce  data  exchanges 
according  to  the  interaction  specification  is  reguired. 

The  above  discussion  addresses  the  division  of  a 
problem  into  tasks  with  interactions  defined.  The  mapping  of 
tasks  to  processors  was  intentionally  left  out.  If  the 
system  is  comprised  of  a  set  of  homogeneous  processing 
modules,  and  if  a  shared  memory  architecture  is  used,  a  task 
can  be  made  to  run  on  any  processor.  Interactions  are  made 
to  the  shared  memory,  not  to  specifically  addressed 
processors.  With  these  assumptions,  the  programmer  can 
design  the  tasks  and  interactions  without  any  knowledge  of 
the  number  of  processors  or  specific  assignment  of  tasks  to 
processors.  This  any-task-on-any-processor  scenario  eases 
the  software  development  for  the  multiprocessor  by  allowing 
the  programmer  to  concentrate  on  the  tasks  and  interactions 
required,  not  the  architecture  used. 

(5.3)  Analysis  of  Available  Real  Time  Operating  Systems 
(RTOS ) 

Given  the  desire  to  use  off-the-shelf  components  where 
appropriate,  an  analysis  of  existing  RTOS's  [16]  was 
required.  Four  packages  developed  for  use  in  MC68000  systems 
were  chosen:  VRTX  from  Ready  Systems  [17],  MTOS68K  from 
Industrial  Programming  Inc.  [18],  RMS68K  from  Motorola  [19], 
and  PSOS/PRISM  from  Software  Components  Group  [20,21].  The 
most  interesting  result  of  the  study  was  that  the  user 
interfaces  in  these  packages  vary  little  from  one  to 
another.  The  main  differences  in  the  interfaces  occur  in  the 
communication  and  synchronization  primatives.  These 


primatives  will  be  discussed  first,  then  discussion  of  the 
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common  functionalities  will  follow. 

Three  basic  data  structures  are  used.  The  semaphore  is 
probably  the  most  studied  and  written  about  construct  for 
mutual  exclusion  and  intertask  synchronization.  In  its 
simplest  form,  a  semaphore  is  a  "flag"  on  which  two 
operations  can  be  performed.  The  "P"  operation  will  allow 
processing  to  proceed  if  the  flag  is  in  the  "pass"  state, 
but  "stops"  processing  otherwise.  The  semaphore  can  be  used 
to  protect  a  resource,  or  insure  mutual  exclusion.  One  task 
sets  the  flag  and  proceeds  to  use  the  resource.  Another  task 
which  attempts  the  "P"  operation  after  this  point  will 
encounter  the  "stop"  state,  and  as  such,  is  prevented  from 
proceeding  in  using  the  resource.  When  the  task  is  done  with 
the  resource,  the  second  operation,  "V",  is  used  to  set  the 
flag  back  to  the  "pass"  state  so  that  a  waiting  task  can  use 
the  resource.  Other  variants  of  semaphores  exist,  including 
counting  semaphores  which  are  used  to  manage  multiple 
resources.  [22,23]  provide  more  complete  descriptions  of 
semaphores  and  their  variants. 

Event  flags  are  groups  of  simple  flags.  Each  flag 
signals  whether  a  particular  event  has  or  has  not  occured. 
Processes  can  wait  on  event  flags  much  like  for  semaphores, 
although  operations  such  as  "P"  and  "V"  above  are  not 
defined  specifically  for  them.  What  characterizes  event- 
flags  is  the  ab  lity  to  combine  waiting  on  multiple  event 
flags  simultaneously  with  "AND"  and  "OR"  operations. 

Mailboxes  are  a  higher  level  construct.  Although 
providing  a  wait-on  capability  like  for  the  structures 
above,  mailboxes  also  provide  an  inherent  means  to  pass 
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data.  A  "post"  operation  to  a  mailbox  is  similar  to  the 
semaphore's  "V"  operation,  except  that  a  pointer  (or  unit  of 
data)  is  stored  as  a  part  of  the  operation.  A  "pend" 
operation  corresponds  to  the  "P"  operation,  with  the 
addition  of  receiving  the  previously  stored  pointer  (or 
data) . 

PSOS  and  MTOS  utilize  all  of  the  above  constructs;  and, 
as  such,  provide  the  most  flexibility  to  the  user.  VRTX  only 
provides  support  for  mailboxes.  The  argument  given  for  this 
is  that  the  functionality  of  event  flags  and  semaphores  can 
be  emulated  with  mailboxes  [24].  The  use  of  a  single 
structure  provides  a  common  interface  to  the  programmer. 
RMS68K  only  supports  semaphores. 

Most  of  the  user  interface  functions  are  common  to  all 
of  the  above  packages;  the  specific  operations  on  the  data 
structures  above  being  the  primary  exception.  Typical 
functions  include  task  creation,  deletion,  suspension,  and 
resumption;  memory  management;  get  and  set  system  time;  and 
modify  task  priority.  All  use  basically  the  same  task  state 
model  and  utilize  a  real  time  clock  or  counter  which  is 
accessable  by  the  user. 

The  most  important  characteristic  for  this  study, 
however,  is  the  support  for  multiple  processors.  RMS68K,  as 
described  in  the  literature,  does  not  address  this 
capability.  VRTX  does  not  support  multiprocessors  directly; 
however,  Ready  Systems  does  describe  a  way  to  utilize  their 
single  processor  product  on  multiple  processors  with  shared 
memory  and  interprocessor  interrupts  [25].  MTOS  supports 
multiprocessors,  but  the  method  is  for  systems  utilizing 
shared  memory  in  a  "tightly  coupled"  fashion.  The  multiple 
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processors  acquire  their  tasks  from  a  single  (central) 
queue.  Software  Components  Group  uses  a  modular,  building 
block  approach  to  their  package.  PSOS  is  the  multitasking 
kernel  located  on  each  processor  in  the  system.  An 
additional  package,  PRISM,  is  added  to  provide 
multiprocessor  capability.  Like  MTOS,  PRISM  requires  a  bus 
based  architecture  and  shared  memory.  Interprocessor 
interrupts  are  desired,  however,  polling  can  be  used  if  they 
are  unavailable.  Tasks  communicate  by  "datagrams"  which  are 
queued  up  in  each  processor.  Tasks  need  to  have  knowledge  of 
the  destination  task's  processor  identification  number.  This 
requirement  is  actually  transparent  to  the  application 
programmer  through  the  use  of  a  "global  name  server"  which 
converts  a  logical  "name"  to  physical  address,  however,  it 
appears  that  the  "server"  must  be  updated  when  tasks  are 
assigned  to  processors. 

(5.4)  Analysis  of  the  Multiprocessor/Multitasking  Problem  in 
the  Control  Environment 

(5.4.1)  The  Multiple  Processor  Problem 

Of  the  above,  PSOS/PRISM  and  VRTX  appear  to  correspond 
the  best  to  the  model  required.  Interestingly,  both  sets  of 
literature  indicate  that  two  techniques  are  possible  to 
accomplish  the  multiple  processor  communication  and 
synchronization:  interprocessor  interrupt  and  polling 
[21,25].  To  explain  the  need  for  either  of  these  techniques, 
consider  the  following  scenario.  A  task  on  processor  n  is 
waiting  for  data  from  another  task.  Let  us  assume  that  this 
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task  is  resident  on  another  processor  m.  If  each  processor 
has  its  own  set  of  task  queues  and  the  task  on  processor  n 
waits  for  the  data  (either  by  going  to  a  wait  queue  or 
polling)  ,  how  does  the  task  on  processor  in  signal  to  the 
other  task  that  the  data  is  ready?  This  situation  is 
complicated  further  if  the  task  on  n  is  put  to  "sleep" 
waiting  for  the  data,  allowing  another  task  to  use  the 
processor.  In  the  single  processor  multitasking  model,  the 
process  is  accomplished  via  an  operation,  such  as  "V"  above, 
that  will  free  the  waiting  task  from  the  wait  queue  once  the 
data  is  available.  Across  multiple  processors,  however,  one 
processor  should  not  have  access  (modification  rights)  to 
another's  task  queues.  Some  sort  of  signal,  such  as  an 
interrupt,  is  required  for  one  local  kernel  to  indicate  to 
another  that  this  operation  needs  to  be  performed  and  which 
task  is  to  be  signalled. 

Interprocessor  interrupts  provide  the  most  direct  way 
to  accomplish  this.  An  interrupt  service  routine  on  each 
processor  can  be  set  up  to  act  much  like  a  local  "V" 
operation.  Limitations  exist,  however.  The  interrupt  scheme 
in  hardware  must  be  any-to-any  in  order  to  support  the  any- 
task-on-any-processor  scenerio.  Also,  knowledge  of  a 
"consumer"  task's  processor  must  be  available  in  order  to 
activate  the  proper  interrupt.  Interrupting  all  processors 
in  a  broadcast  manner  is  wasteful  and  inefficient.  The 
knowledge  that  is  required  does  not  preclude  the  desire  for 
any-task-on-any-processor ,  but  does  create  a  need  for  a 
method  to  tag  tasks  with  a  processor  identification  number 
when  they  are  activated  on  that  processor,  and  to  make  this 
available  as  part  of  the  communicat ions/waitinc  process. 


Polling,  as  the  alternative  to  interrupts,  is  often 
discarded  quickly  as  being  inefficient.  A  task  that  is 
waiting  for  data,  and  stays  on  the  processor  or  ready  task 
queue  while  waiting  for  a  data  available  indication,  would 
appear  to  waste  precious  processor  time  busy  waiting.  Also, 
the  context  switch  time  in  repeatedly  bring  a  polling  task 
onto  a  processor,  then  removing  it  when  the  data  is 
unavailable,  can  be  significant.  In  short,  the  disadvantages 
involve  multiple  passes  through  the  processor  while  polling, 
each  with  a  costly  context  switch  time.  The  advantages 
include  a  much  simpler  means  of  implementing  the  any-task- 
on-any-processor  scenerio,  without  any  regard  for  processor 
identification. 

The  two  methods  above  assume  a  coordinated  transfer  of 
data  between  tasks.  Prescheduled  tasks  with  timed  transfers, 
as  discussed  earilier  in  this  chapter,  is  an  example  of  an 
exchange  technique  without  coordination,  or  handshaking. 
This  technique  requires  much  programmer  precompilation 
knowledge  of  execution  time  and  task/processor  placement. 

(5.4.2)  Intertask  Communications 

Several  levels  of  coordinated  transfers  can  be  used. 
The  simplest  case  involves  a  basic  flag  which  signals  when 
data  is  ready.  This  unilateral  rendezvous  [24]  has  only  the 
consumer  of  the  data  wait  in  the  exchange.  Producers  can 
update  at  any  time  and  do  not  wait  for  the  consumer  to 
respond.  Bilateral  rendezvous  involve  both  producers  and 
consumers  waiting  until  both  are  ready.  The  bilateral 
transfer  provides  the  most  reliable  exchange  since  it  is 


assumed  that  both  sides  of  the  transfer  are  active  and  meet 
during  the  same  time  window.  Unilateral  transfers  require 
much  less  overhead  to  effect  the  exchange. 

(5.4.3)  Control  Timing 

Digital  real  time  control  systems  usually  are 
characterized  by  two  types  of  processing:  interrupt  driven, 
asynchronous  response  to  external  events  and/or  periodic 
sample/compute/output  sequences.  The  latter  case  is  of 
primary  concern  in  the  case  of  the  URV  system.  The  resultant 
requirement  is  for  a  means  to  schedule  tasks  on  a  strict 
periodic  basis.  Two  methods  have  been  identified  to  handle 
this  situation.  The  assumptions  made  from  the  previous 
discussions  are  that  tasks  are  coordinated  in  a  rendezvous¬ 
like  fashion  and  that  a  separate  I/O  section  exists  to 
provide  the  tasks  with  input  data  and  to  take  task  outputs 
to  convert  to  servo  actuator  signals. 

In  the  first  case,  tasks  can  be  ordered  in  a  dataflow¬ 
like  manner  in  which  the  output  of  one  task  is  required 
before  another  can  start.  The  processing  of  a  task, 
therefore,  is  characterized  by  one  or  more  sequences  of: 
receive  inputs,  compute,  send  outputs.  The  I/O  section  can 
be  programmed  to  handle  interrupts  or  loop  in  a  timed  manner 
to  sample  data.  The  I/O  section  then  makes  a  rendezvous  with 
the  first  computational  section  task  (or  set  of  tasks)  in 
the  control  law  computations.  This  task  then  "fires"  other 
tasks,  and  so  on.  When  the  tasks  complete,  they  go  into  a 
wait  loop,  awaiting  the  next  rendezvous  to  start  again. 

An  obvious  problem  with  this  case  is  that  tasks  are 
spending  precious  processor  time  waiting  for  a  rendezvous  (a 


polling  type  of  situation)  in  order  to  start  the  next 
computation.  A  better  method  is  to  utilize  a  special  wait 
queue  to  hold  tasks  for  a  specified  period  of  time.  This 
method  requires  a  timer  or  counter  function  to  be  available 
in  the  kernel.  The  kernel  monitors  the  queue  and  the  timer 
to  determine  when  tasks  are  to  be  removed.  Task  periodicity 
is  then  accomplished  by  placing  the  tasks  on  the  queue  once 
their  computation  is  complete.  Once  there,  they  are 
suspended  for  a  specified  period  of  time  (the  period  of 
control  computation)  until  the  kernel  removes  them.  The 
dataflow-type  rendezvous  method  of  the  first  case  can  then 
be  used  to  insure  proper  task  sequencing. 

(5.5)  URV  Real  Time  Multiprocessor/multitasking  Operating 
System  ( RTMOS )  Kernel  Specification 

Although  PSOS  and  VRTX  may  have  been  suitable  for  use 
in  the  URV  multiprocessor  system,  several  factors  led  to  the 
decision  to  design  and  develop  a  multitasking  kernel  instead 
of  purchasing  one.  Cost  was  one  such  factor.  [16]  gives 
typical  price  information.  Note  that  source  code,  far  more 
costly  than  object  code,  is  required  if  modifications  are 
needed.  The  most  critical  factors  were  the  learning  curve 
and  porting-to-target  times.  Unlike  an  off-the-shelf 
processor  board,  a  commercial  RTOS  is  not  usable  when  it 
arrives.  The  software  must  be  ported  to  the  target  processor 
system  by  providing  the  software  "hooks"  between  the 
provided  code  and  target  resources,  such  as  RAM  and  timers. 
The  cede  may  also  require  transfer  to  new  ROM  chips  to 
accomodate  the  target  board  design.  The  time  taken  to  port 
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the  new  code  and  learn  how  to  interface  to  it  is 


significant;  much  more  than  setting  up  an  off-the-shelf 


processor  board.  Because  of  these  factors,  a  short  study 


investigated  those  features  needed  in  the  URV  kernel.  The 


results  of  the  study  were  that  the  functions  required  were 


fewer  in  number  than  commercial  RTOS's  and  not  difficult  to 


implement,  and  that  an  in-house  developed  kernel  would 


provide  the  flexibility  and  ease  of  use  required.  Upon 


development  of  the  kernel  features,  as  discussed  below,  a 


further  discovery  was  made  that  performance  improvements 


could  be  realized  by  tailoring  the  kernel  to  the  application 


and  target  hardware.  The  output  of  the  resultant  development 


effort  was  the  URV  Real-Time  Multiprocessor/multitasking 


Operating  System  ( RTMOS ) ,  a  kernel  which  operates 


identically  on  each  processor  in  the  system  with  the  added 


capabilities  of  multiprocessor  interactions 


synchronization. 


(5.5.1)  Kernel  Structures 


This  section  describes  the  resultant  kernel  and 


intertask  communications  specification  for  the  URV  RTMOS.  To 


begin,  let  us  review  the  underlying  hardware  structure.  The 


computational  section  is  comprised  of  a  set  of  MC68000 


processor  boards  interconnected  by  a  VME  backplane  bus.  The 


basic  means  of  interprocessor  communication  is  shared  memory 


comprised  of  the  uniquely  addressable  dual  port  memory 


segments  on  each  processor  board.  An  interprocessor 


interrupt  capability  is  specified  for  VME  and  can  be 


implemented  as  any-to-any.  However,  the  interrupt  scheme 


requires  knowledge  (processor  number  or  address)  of  the 


destination  of  the  signal.  On  each  board  is  a  resident 
MC68681  DUART  chip  to  be  used  for  serial  communication 
between  the  processor  board  and  an  external  terminal.  This 
chip  also  contains  internal  timers. 

As  discussed  earier,  the  single  processor  multitasking 
kernel  model  is  used  as  a  baseline.  This  means  that  each 
processor  has  its  own  set  of  task  gueues.  The  queues  are 
loaded  with  some  subset  of  the  total  set  of  tasks.  Any  task 
can  operate  on  any  processor  without  modification  or 
precompilation  knowledge  of  its  own  or  any  other  task's 
processor.  This  allows  the  set  of  tasks  to  operate  on  one  or 
any  number  of  processors  by  simply  loading  the  tasks  queues 
of  each  processor  available  with  some  subset. 

In  the  single  processor  multitasking  model,  a  task 
ready  queue  is  used  to  hold  and  order  tasks  awaiting 
processor  execution  time.  Priorities  are  generally 
associated  with  tasks  so  that  ordering  of  the  queue  will 
reflect  the  relative  importance  or  any  time  constraints  of 
the  tasks  needing  processor  time.  The  URV  RTMOS  kernel 
builds  upon  this  basic  structure. 

In  addition  to  the  ready  queue,  a  timer  queue  is  used. 
This  queue,  described  in  the  previous  section,  is  used  to 
hold  tasks  for  a  specified  period  of  time.  The  queue  is 
ordered  by  release  time  rather  than  by  priority.  This 
ordering  allows  easy  scanning  for  task  release  time  since 
the  search  proceeds  only  from  the  front  entry  to  the  first 
which  is  not  ready  to  be  released. 

The  use  of  a  timer  queue  requires  a  timer  or  clock 
function  to  be  implemented  within  the  kernel.  The  DUART  chip 


mentioned  above  allows  for  this  to  be  added.  The  chip  can  be 
programmed  to  interrupt  the  processor  at  periodic  intervals. 
In  the  URV  RTMOS  kernel,  this  interval  was  chosen  to  be  one 
millisecond  so  that  control  loop  or  task  frequencies  of  up 
to  1000  Hz  can  be  implemented.  The  interrupt  routine, 
referred  to  as  the  tick  function,  updates  a  location  in 
memory  as  a  counter.  This  location  is  used  as  a  clock 
relative  to  system  start-up.  All  task  timings  are  based  upon 
this  clock. 

A  diagram  of  the  queue  structure  of  a  single  processor 
is  given  in  Figure  5.1. 

(5.5.2)  Task  Data  Exchanges  and  Context  Switches 

Task  data  exchanges  are  handled  via  a  variant  of  a 
unilateral  rendezvous.  The  data  structure  used  in  the 
exchange  is  based  upon  the  mailbox  concept  described  above. 
The  structure  is  comprised  of  four  parts.  The  first  part  is 
a  simple  semaphore  which  is  used  to  ensure  mutual  exclusion 
over  the  rest  of  the  construct.  The  second  part  is  a  flag 
which  enforces  the  producer  and  consumer  relationship  of  the 
rendezvous.  The  third  variable  is  a  count  of  data  words  to 
follow  in  the  structure.  As  such,  the  data  being  passed  in 
the  exchange  comprises  the  fourth  part. 

Like  the  unilateral  rendezvous  described  previously, 
the  URV  RTMOS  utilizes  exchanges  where  only  consumers  of 
data  wait.  Before  waiting,  a  timeout  clock  value  is  recorded 
so  that  the  wait  time  is  bounded.  The  difference  from  the 
basic  unilateral  transfer  comes  in  the  use  of  the  second 
piece  of  the  data  structure,  the  producer/consumer  flag. 
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This  flag  has  three  states:  producer's  turn,  consumer's 
turn,  and  failed  exchange.  The  first  two  states  are  self- 
explainatory .  The  last  is  used  in  the  event  of  a  timeout  or 
some  other  failure  event.  If  the  consumer  detects  a  timeout, 
it  can  signal  to  the  producer,  through  the  flag,  that  data 
was  not  received  within  the  prescribed  time  window.  In  this 
way,  late  producers  or  late  consumers  are  detected  and 
transfers  cease  until  corrective  measures  are  taken  to 
"resynchronize"  the  tasks.  This  process  will  be  addressed 
later. 

As  mentioned  previously,  polling  is  often  considered  to 
be  an  undesirable  option  in  multitasking  systems.  The 
reasons  for  this  include  costly  context  switching  times  and 
"clogging"  of  the  ready  queue.  Generally,  a  context  switch 
involves  a  switch  to  the  executive  (via  a  jump,  subroutine 
call,  or  software  interrupt),  a  saving  of  the  state  of  the 
current  running  task  (its  context),  the  removal  and  storage 
elsewhere  of  this  task,  and  the  acquisition  of  the  next 
ready  task.  This  new  task's  state  is  then  restored  and 
control  transferred  back  to  the  user  task  mode  of  operation. 
In  terms  of  the  MC68000,  which  of  these  components  are  the 
most  costly?  Queue  management,  for  simple  queues  such  as 
described  above,  does  not  require  extensive  processor  time. 
Mode  changing  and  state  saving/retrieving  can,  however.  The 
commercial  packages  described  earlier  change  from  user  to 
executive  mode  with  a  TRAP  (or  software  interrupt) 
instruction.  The  cost  is  in  the  partial  storage  of  the 
processor  state  taken  by  the  MC68000  upon  interrupt;  and 
this  process  is  performed  each  time  a  system  function  is 
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The  task  state  storage/retrieval  is  the  main  concern  in 
context  switches.  The  MC68000  has  sixteen  32-bit  internal 
registers.  Since  the  kernel  has  no  means  of  determining 
which  registers  the  exiting  task  is  using,  it  must  save  all 
of  them.  This  makes  polling  undesirable  since  one  failed 
polling  attempt  results  in  32  words  of  data  (minimum)  to  be 
stored,  then  later  retrieved. 

Two  flaws  exist  in  the  techniques  described  above. 
First,  a  TRAP  should  not  be  performed  each  time  a  data 
exchange  is  to  be  made.  The  purpose  for  the  TRAP  is  to  force 
the  processing  into  the  kernel  for  proper  and  hidden  access 
to  RTMOS  resources  such  as  the  task  queues.  Having  the 
kernel  also  preside  over  data  exchanges  forces  unnecessary 
overhead.  Other  options  exist  to  hide  data  structure  details 
from  the  applications  programmer.  These  will  be  discussed 
later. 

The  other  flaw  is  the  storage/retrieval  of  all 
registers  each  time  a  polling  iteration  occurs.  The  context 
need  only  be  stored  and  restored  once.  The  task's  registers 
can  be  saved  at  the  first  failed  poll  and  be  retrieved  only 
after  the  poll  is  successful.  In  between,  the  registers  are 
not  used.  Also,  since  the  kernel  is  unable  to  determine  on 
its  own  which  registers  are  being  used,  a  less  brute  force 
solution  is  to  have  the  task  specify  the  registers  in  use; 
or  better  yet,  make  the  task  save  its  own  registers.  This 
option  is  possible  if  a  task's  stack  pointer  points  to 
within  its  own  Process  Control  Block  (PCB) ,  a  data  structure 
used  by  the  kernel  for  task  information  storage  and  queue 
linkage.  Again,  this  process  can  be  made  transparent  to  the 
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applications  programmer  without  putting  the  burden,  and  the 
overhead,  into  the  kernel. 

As  can  be  seen  from  the  discussion  above,  the 
functionality  of  the  kernel  has  been  limited  basically  to 
the  management  of  the  multitasking  queues  ( ready , timer)  and 
the  handling  of  interrupts.  How  do  we  keep  the  burden  of 
underlying  communications  data  structures  and  polling  from 
the  applications  programmer  without  placing  it  within  the 
kernel?  In  the  case  of  the  URV  RTMOS ,  where  assembly 
language  has  been  used  in  the  initial  stages  of  cole 
development,  macros  are  one  possibility.  Macros  provide  a 
generic  unit  of  code  with  "gaps"  in  which  specific 
instantiation  information  can  be  placed.  In  place  of  a 
section  of  code  which  implements  a  polling  sequence  on  a 
particular  data  structure,  a  macro  instruction  can  be  used. 
Parameters  in  the  polling  macro  instruction  specify  the  data 
structure  name  (label) ,  the  registers  in  use,  and  the  source 
or  destination  of  exchange  data.  Macro  instruction  libraries 
can  be  supplied  to  applications  programmers  as  easily  as 
system  functions  implemented  as  TRAP'S.  In  effect,  macros 
are  similar  to  additional,  higher  level  instructions 
provided  for  use  by  the  applications  programmer.  In  higher 
order  language  implementations,  special  procedure  calls  can 
be  utilized  for  the  same  purpose. 

To  demonstrate,  the  macro  instructions  for  the  producer 
and  consumer  polling  will  be  discussed.  First  in  algorithmic 


Producer 


Consumer 


P[var(sem) ] 

If  var (P/C)  <>  C  then 
Report  Error  to  System 
var (P/C)=C 
Else 

Put  var(data) 
var (P/C)=P 
End  if 
V[var(sem) ] 


P[var(sem) ] 

Save  registers 
While  (var(P/C)=C)  and 
(not  timeout)  do 
V[var(sem) ] 

Wait 

P[ var ( sem) ) 

End  while 

If  var (P/C)  =  P  then 
var  ( P/C)  =  C 
Else 

var (P/C)  =  F 
Report  Error  to  System 
End  if 

Get  var(data) 

Retrieve  registers 
V[var(sem) ] 


Note  that  the  algorithms  handle  the  mutual  exclusion  over 
the  data  structure,  the  storage/retrieval  of  specified 
registers,  the  checking/updating  of  the  producer/consumer 
flag,  the  monitoring  of  exchange  timeouts,  and  the  transfer 
of  data  to  and  from  the  exchange  data  structure,  only  the 
"Wait"  operation  is  an  actual  call  to  the  kernel.  The  rest 
of  the  algorithm  is  actually  performed  in  user 
(applications)  mode,  but  is  hidden  in  source  code  from  the 
applications  programmer  by  the  use  of  macro  instructions. 
This  programmer  would  implement  a  consumer  poll  with  the 
macro  instruction  c.poll  such  as  in  the  following  example: 


c.poll  tasklink,d0-d3/a5,mydata 

where  tasklink  =  exchange  data  structure 

d0-d3/a5  =  specifies  that  do , dl , d2 , d3 , a5  are 
to  be  saved 

mydata  =  destination  of  data  from  exchange 


The  source  code  to  c.poll  and  the  corresponding  p.poll  macro 
are  included  in  Appendix  (Al) . 


The  question  remains  on  whether  polling  is  a  viable 
option.  Context  switch  times  have  been  reduced  (as  will  be 
discussed  in  a  later  chapter) .  Kernel  calls  have  been 


reduced  in  number  and  have  been  made  simpler.  The 
implementation  of  any-task-on-any-processor  is  also 
simplified.  Still,  the  potential  problem  of  polling  numerous 
times  before  succeeding  is  troublesome.  The  answer  may  lie 
in  the  proper  use  of  the  timer  queue. 

The  situation  to  avoid  when  polling  is  to  have  a 
consumer  of  data  begin  waiting  much  before  the  producer  is 
able  to  have  it  available.  In  other  words,  the  desire  is  for 
a  means  to  appropriately  "order"  producers  and  consumers. 
The  extreme  of  having  the  applications  programmer  time 
schedule  tasks  in  advance  has  been  previously  discarded. 
However,  the  ability  exists  to  stagger  the  release  times  of 
tasks  from  the  timer  queue  such  that  tasks  at  the  the 
"front"  of  a  thread  of  computation  are  released  before  those 
at  the  "end".  This  staggering  need  not  be  precise;  in  fact, 
the  indeterminacy  of  the  queues'  orderings  makes  this 
extremely  difficult,  if  not  impossible.  However,  the  desire 
is  only  to  reduce  polling  to  one  or  a  couple  of  tries.  With 
the  timer  queue  described  above,  this  situation  does  not 
appear  difficult  to  realize.  Experience  in  actual  usage  is 
required  to  demonstrate  the  practicality  of  polling  with 
timer  queues. 

Another  feature  has  been  built  into  the  RTMOS  to  limit 
the  burden  on  the  applications  programmer.  To  prevent  the 
programmer  from  having  to  design  in  points  to  release  the 
processor  to  other  tasks  in  order  to  create  fairness, 
processor  time  slicing  was  added.  This  feature  will 
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automatically  switch  out  tasks  which  hold  a  processor  for  an 
extended  period  of  time.  In  this  way,  long  running  tasks, 
such  as  those  with  multiple  nested  loops,  can  be  run  without 
"starving"  other  tasks  on  the  processor.  Not  all  tasks  or 
sections  of  code  within  tasks  should  be  time  sliced, 
however.  An  example  is  the  polling  macros  above.  If  sliced 
during  one  of  these  macros,  a  task  will  hold  "ownership"  of 
the  semaphore,  and  likewise  the  variable,  while  the  task  is 
reawaiting  processor  time  in  the  ready  queue.  To  prevent 
this  and  other  similar  situations,  each  task  PCB  contains  a 
time  slice  inhibit  flag.  System  functions  which  turn  on  and 
turn  off  the  time  slicing  priviledge  are  used  around 
sections  of  code  where  a  slice  can  cause  problems  to  occur. 
For  example,  these  functions  have  been  incorporated  into  the 
macro  instructions  previously  described.  The  flag  can  also 
be  put  in  the  inhibit  state  at  initialization  and  left 
untouched.  This  feature  permits  an  "unsliceable"  task. 

The  resulting  user  interface  to  the  kernel  is  comprised 
of  seven  system  calls:  release  (or  exit),  release-on-poll, 
sleep,  terminate,  report-to-system,  slice-on,  and  slice-off. 
Release  and  release-on-poll  cause  the  current  running  task 
to  be  removed  from  the  processor  and  placed  at  the  end  of 
the  ready  queue.  The  next  ready  task  is  then  assigned  the 
processor.  In  release-on-poll,  additional  functions  are 
performed  particular  to  a  polling  task.  Sleep  does  the  same 
as  release  except  that  the  previously  running  task  is  placed 
into  the  timer  queue  rather  than  the  ready  queue. 

Terminate  does  not  place  the  previous  task  on  any 
queue;  as  a  result,  the  task  effectively  "dies"  rather  than 
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is  suspended,  where  reactivation  is  assumed.  This  function 
is  used  for  nonrecurring  tasks  which  are  started  up  at 
initialization  or  by  the  RTMOS  at  certain  times  for  single 
run  operation.  One  case  where  this  may  occur  is  when  a  task 
makes  a  report-to-system  system  call.  The  report-to-system 
call  is  used  to  report  error  conditions,  such  as  timeouts, 
to  the  RTMOS  so  that  corrective  measures  can  be  taken.  An 
example  of  a  single  run  task  which  may  be  started  upon  a 
report  of  timeout  is  one  which  will  sample  the  individual 
clocks  of  each  processor  of  the  system  to  detect  any 
significant  disagreement.  This  task  may  in  turn  report  the 
findings  to  the  system  for  further  corrective  actions. 

The  overall  structure  of  the  kernel  has  now  been 
presented.  A  corresponding  task  state  transition  diagram  is 
given  in  Figure  5.2.  Note  that  the  states  of  the  URV  RTMOS 
tasks  resemble  those  available  in  the  commercial  packages 
surveyed  above. 

One  final  topic  in  the  context  of  the  kernel  merits 
discussion.  Task  code  is  static,  ROMable,  and  reentrant.  To 
activate  a  task  or  instantiate  multiple  copies  of  a  task, 
task  PCB's  are  used.  These  structures  form  the  run-time 
representation  of  tasks.  A  task  PCB  contains  a  task's 
context  when  it  is  not  running,  its  stack,  queue  linkage 
pointers,  status  flags,  timeout  information,  and  a  pointer 
to  the  task  code.  To  activate  a  task,  a  PCB  is  loaded  with 
the  appropriate  initial  task  data  and  linked  to  the 
appropriate  queue.  To  activate  multiple  tasks,  multiple 
PCB's  pointing  to  the  same  task  code  unit  are  created.  Task 
code  is  either  structured  in  an  infinite  loop,  or  contains  a 
terminate  kernel  call.  Appropriate  processor  release  calls 


are  either  programmed  in;  or  time  slicing  is  enabled, 
allowing  the  kernel  to  automatically  remove  the  task  at 
periodic  intervals.  Examples  of  tasks  can  be  found  in  the 
source  code  listing  in  Appendix  (A3) .  Their  associated  PCB's 
are  given  with  the  source  code  for  RTMOS  in  Appendix  (A2) . 

(5.6)  Remaining  RTMOS  Features 

The  kernel  is  just  one  part  of  the  URV  RTMOS.  Figure 
5.3  shows  a  layered  representation  of  the  RTMOS,  its 
features,  and  its  relationship  to  applications  tasks.  The 
only  part  of  the  RTMOS  not  addressed  to  this  point  is  the 
system  tasks  which  it  employs.  System  tasks  are  different 
than  applications  tasks  in  that  they  are  basically  a  part  of 
the  RTMOS,  are  always  resident  in  each  processor,  and  may 
have  knowledge  of  their  own  processor.  The  reason  why  these 
are  discussed  last  is  that  only  a  couple  of  system  tasks 
have  been  designed  and  used  in  the  URV  multiprocessor 
avionics  prototype.  The  first  system  task  reports  system 
errors  to  a  terminal  for  diagnostic  purposes.  This  task  is 
initiated  by  the  report-to-system  system  call.  Because  of 
the  simplistic  nature  of  the  task,  it  will  not  be  discussed 
in  further  detail. 

Another  system  task  implemented  in  the  prototype 
provides  an  important  interprocessor  synchronization 
function.  Without  this  capability,  the  individual  processor 
clocks  can  skew.  Since  these  clocks  determine  task  release 
times  from  timer  queues,  skewing  can  create 
producer/consumer  timing  problems,  including  exchange 
timeouts.  To  prevent  this,  a  periodic  system  task  is 
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assigned  to  each  processor  to  resynchronize  the  clocks  of 
the  system.  For  the  purposes  of  the  prototype,  a  simple 
algorithm  is  used.  In  this  algorithm,  each  processor 
(actually  the  synchronization  task  operating  on  the 
processor)  waits  until  all  processors  report  to  a  table  in 
shared  memory,  at  which  time  the  clocks  are  set  identically. 
All  of  these  synchronization  tasks  have  the  same  timer  queue 
release  times  and  are  the  first  tasks  in  each  processor  to 
be  performed.  If  the  tasks  are  scheduled  appropriately, 
skewing  will  be  controlled  by  reorienting  the  clocks  at 
periodic  intervals. 

As  is  often  the  case,  the  simple  solution  is  not  the 
best.  The  above  algorithm  has  a  weakness  in  that  the 
operation  of  the  system  is  delayed  until  synchronization  is 
completed.  Also,  the  above  algorithm  does  not  account  for 
the  possibility  of  failed  processors;  a  situation  which  will 
deadlock  the  system.  A  better  algorithm  would  have  each 
processor  maintain  its  clock  value  or  a  copy  of  its  clock 
value  in  its  own  section  of  the  distributed  shared  memory. 
Resynchronizing  a  processor's  clock  would  be  the  job  of  that 
processor  alone,  with  no  interaction  with  or  waiting  for 
another  processor.  The  resynchronization  would  involve 
looking  at  all  the  clocks  and  resetting  the  local  clock 
according  to  a  consensus  algorithm.  This  method  is  more 
complicated  than  it  at  first  seems.  More  work  remains  in 
this  area,  however.  The  original  algorithm  given  has 
sufficed  for  the  first  phase  prototype. 

Other  potential  system  tasks,  not  designed  for  this 
first  phase  prototype,  warrant  discussion.  The  task 


assignment  for  the  current  prototype  is  static;  in  other 
words,  at  initialization,  the  tasks  are  assigned  to 
processors  and  loaded  into  queues.  This  assignment  is  never 
changed  throughout  the  runtime  of  the  system.  However,  a 
more  practical  scenerio  would  have  the  system  configure  the 
task  assignment  according  to  available  processing  resources 
and  reconfigure  the  tasks  in  the  event  of  a  detected 
failure.  The  processes  of  task  assignment,  failure 
detection,  and  reconfiguration,  therefore,  comprise  an 
important  part  of  the  system  task  area.  These  tasks  will  be 
discussed  in  the  following  paragraphs  in  terms  of  their 
theory  of  operation.  Other  system  tasks  are  possible,  but 
further  work  on  the  URV  RTMOS  is  needed  to  determine  the 
requirements . 

Task  assignment  can  be  carried  out  in  many  ways.  One 
technique  which  has  been  proven  is  the  CRMMFCS  system  of 
self-  checks  and  volunteering  [4].  Three  basic  steps  are 
involved.  Upon  notification  of  a  task  assignment  cycle,  a 
task  on  each  processor  performs  a  brief  self-check  on  its 
processor.  The  second  step  involves  reporting  the  results 
(health  status)  to  a  table  in  the  shared  memory.  The  third 
step,  which  takes  place  after  a  suitable  delay  to  let  all 
processors  report,  has  a  task  check  the  status  of  the  table, 
count  the  number  of  healthy  processors,  and  choose  a  set  of 
tasks.  Various  acceptable  techniques  exist  for  this  last 
step;  as  such,  no  further  discussion  is  required. 

Failure  detection  tasks  can  take  many  forms  also.  Low 
priority  self-checks  can  be  used  for  "background"  built-in¬ 
test.  The  clock  checking  task  put  into  the  system  upon 
timeout  (described  previously)  is  another  example.  These 


self-check  tasks  can  be  of  single  run  type,  timer  queue 
type,  or  low  priority  ready-queue-only  type.  The  results  of 
the  failure  detection  can  result  in  a  variety  of  responses. 
The  timeout  task  above  could  resynchronize  the  clocks  if  the 
current  clock  value  of  each  processor  is  kept  in  shared 
memory.  Perhaps  a  reconfiguration  could  be  triggered  by 
interrupts  or  special  command  words  monitored  regularly  by 
the  individual  kernels. 

Reconfiguration  itself  is  a  variant  on  the  task 
assignment  task(s) .  Reconfiguration  involves  recognizing  the 
request,  confirming  the  validity  of  the  request,  and 
carrying  out  the  request.  This  last  step  could  be 
accomplished  with  a  task  assignment  cycle  running  concurrent 
with  the  tasks  still  executing  in  the  system.  Of  course, 
these  system  tasks  would  be  of  higher  priority  than 
applications  tasks.  Note  that  we  may  allow  tasks  currently 
in  the  system  to  continue  to  execute  during  the 
reconfiguration  process.  The  tasks  in  the  system  at  the  time 
of  the  reconfiguration,  however,  are  marked  such  that  they 
are  not  reloaded  into  the  system  queues  upon  their  release 
of  the  processor  (except  for  polling  or  time  slice) .  In  this 
way,  new  tasks  assigned  can  be  loaded  into  the  queues  while 
the  "old"  set  is  finishing  to  allow  a  smooth  and 
uninterrupted  transition. 


As  discussed  in  the  previous  chapters,  the  hardware 
used  in  this  initial  development  stage  consists  primarily  of 


off-the-shelf  purchased  items.  Some  modifications  and 
additions  were  made.  These  enhancements  will  be  discussed  in 
this  chapter.  Similarly,  the  operating  system  and  hardware 
test  software  development  began  from  off-the-shelf  software. 
A  debugger  monitor,  written  by  the  author  for  a  previous 
project,  was  used  for  initial  hardware  checkout  and  provided 
the  basis  for  early  kernel  testing.  This  software,  as  well 
as  the  development  of  kernel  itself,  will  be  discussed  in 
this  chapter.  In  addition,  the  applications  tasks,  I/O 
software,  and  simulation  set-up  will  be  addressed. 

(6.1)  Processor  Board 

The  four  processor  boards  purchased  required  only  minor 
modifications.  As  discussed  previously,  each  board's  dual 
port  RAM  can  be  addressed  uniquely  by  requests  over  the  VME 
bus.  This  required  modifications  to  be  made  to  the  dual  port 
address  decoder  implemented  in  a  Programmable  Logic  Array 
(PLA)  chip.  Once  this  change  was  made,  the  shared  memory 
segments  of  the  four  boards  were  addressed  beginning  at 
80000H ,  100000H ,  180000H,  and  E80000H. 

To  check  out  the  boards,  the  above  mentioned  debugger 
monitor  was  modified  to  match  the  new  address  map  and 
utilize  the  MC68681  DUART  chip  for  character  I/O.  This 
monitor  software  provides  some  basic  commands  such  as  dump 
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memory,  examine  and  modify  memory,  fill  memory,  display 
registers,  set  breakpoints,  and  begin  user  program 
execution.  A  RAM  test  command,  however,  was  most  important. 
This  command  allowed  the  new  decoding  PLA's  to  be  tested, 
and  provided  a  means  to  test  remote  VME  bus  accesses.  Of 
particular  interest  was  a  test  of  bus  loading.  This  test  was 
performed  with  simultaneous  RAM  tests  being  executed  across 
the  VME  bus  by  multiple  processors.  The  test  provided  a 
measure  of  VME  performance  in  a  multiprocessor  environment 
in  a  near  worst  case  loading  situation.  The  results  of  this 
test  are  given  in  Chapter  7 . 

To  determine  the  DUART's  capabilities  to  act  as  the 
generator  of  the  'tick'  function,  a  modification  was  made  to 
the  monitor  in  the  initial  stages  of  testing.  The  DUART  was 
programmed  to  provide  a  once-per-millisecond  interrupt  via  a 
countdown  timer.  The  interrupt  routine  increments  a  counter 
in  memory.  Two  commands  were  added  to  verify  the  operation 
of  the  routines.  One  gives  the  time  in  minutes  and  seconds 
since  monitor  start-up  and  the  other  gives  the  raw  counter 
value. 

(6.2)  Kernel  Development 

The  decision  was  made  at  the  start  of  the  software 
development  process  to  incrementally  design  and  test  the 
functions  and  data  structures  of  the  kernel.  First,  the 
basic  ready  queue  and  queue  access  routines  were  developed. 
Next,  a  set  of  test  tasks  and  the  requisite  task 
communication  macros  were  added.  This  collection  provided 
the  basis  for  the  first  multitasking  tests.  When  this  stage 
was  verfied,  the  timer  queue  and  queue  access  rout ines  were 
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added  and  tested.  This  iterative  process  was  followed  until 
all  the  features  of  the  kernel  were  incorporated  and  tested. 
After  the  verification  of  the  kernel,  the  applications  tasks 
were  designed  and  implemented. 

The  design  of  the  kernel  functions  and  data  structures 
was  fairly  straightforward.  Typically,  however,  the  testing 
of  such  features  can  be  a  difficult  and  lengthy  process.  The 
decision  was  made  to  develop  test  tasks  which  could  visually 
demonstrate  the  correct  operation  of  the  kernel.  Concurrent 
testing  of  tasks  and  the  kernel  was  not  desired.  As  such, 
the  debugger  monitor  discussed  above  was  used  as  the  basis 
of  test  tasks  for  the  kernel.  The  reasons  are  simple.  The 
individual  commands  of  the  monitor  were  previously  verified 
and,  with  little  modification,  were  convertable  to  tasks. 
Each  of  the  commands  chosen  are  user  interactive  or  provide 
visual  evidence  of  operation. 

In  the  initial  task  set,  five  commands  were 
implemented:  dump  memory,  examine  memory,  fill  memory,  show 
time,  and  show  counter.  To  complete  the  set,  the  mainline 
command  interpreter,  character  input,  and  character  output 
routines  were  also  converted  to  tasks.  Character  I/O  was 
handled  by  utilizing  character  input  and  output  queues 
protected  by  special  kernel  functions.  Proper  operation  of 
the  ready  queue  and  associated  routines  was  demonstrated  by 
utilizing  two  processors;  one  with  the  command  line 
interpreter  and  I/O  tasks,  and  the  other  with  the  monitor 
command  tasks  and  I/O  tasks.  Multiple  monitor  commands,  such 
as  dump  and  fill,  were  executed  simultaneously,  with  results 
visually  demonstrated. 


To  test  the  timer  queue  operations,  command  tasks  were 
set  up  to  operate  at  specified  periodic  rates.  For  example, 
the  examine  task  was  set  up  to  execute  once  every  thirty 
seconds.  Obviously,  this  could  be  checked  by  forcing  the 
initiation  of  a  signal  to  the  examine  task  data  structure 
(via  p.poll)  with  the  command  line  interpreter  task.  The 
examine  task,  although  signalled,  did  not  respond  until 
released  from  the  timer  queue  at  the  end  of  its  thirty 
second  wait.  Again,  test  results  were  visual. 

Similarly,  communications  timeouts  and  time  slicing 
were  verified  using  these  tasks.  Although  not  all  kernel 
problems  were  detected  using  these  monitor  tasks,  most  of 
the  complicated  "bugs"  in  the  data  structure  routines  were 
eliminated. 

(6.3)  Floating  Point  Hardware 

Before  the  applications  software  could  be  designed  and 
tested,  the  addition  of  the  floating  point  coprocessor, 
MC6888 1 ,  was  required.  This  coprocessor  was  not  included  on 
the  purchased  boa. us.  The  side  connector  on  the  processor 
board,  however,  allowed  for  the  addition  of  a  wirewrap 
extension  board.  A  schematic  of  the  added  hardware  circuits 
is  given  in  Appendix  (A4). 

The  MC68881  was  developed  primarily  to  be  used  with  the 
MC68020  processor.  The  coprocessor,  however,  can  be  used  as 
a  peripheral  device  for  other  microprocessors  [26].  A  simple 
sequence  of  software  instructions  can  be  used  to  coordinate 
transfers  of  commands  and  operands  between  the  MC68000  and 
MC6888 1  [27]. 

To  test  the  developed  wirewrap  circuits,  the 


nonmultitasking  version  of  the  debugger  monitor  was  again 
used.  Additional  commands  were  added  to  perform  transfers  to 
and  from  the  coprocessor  registers,  floating  point  additions 
and  multiplications,  and  integer  multiplications  and 
additions.  A  single  coprocessor  register  was  used  as  an 
accumulator.  Once  these  single  operation  routines  verified 
the  operation  of  the  coprocessor,  the  applications  software 
was  designed  and  tested. 

The  MC6888 1  handles  three  types  of  floating  point  data 
formats.  Single  precision  is  32  bits  wide  (1  sign  bit/8  bit 
exponent/23  bit  mantissa) ,  double  precision  is  64  bits  wide 
(1/11/52),  and  extended  precision  is  80  bits  wide  (1/15/64). 
In  all  internal  operations  and  registers  of  the  coprocessor, 
extended  precision  format  is  used.  For  the  purposes  of  the 
applications  tasks  to  be  discussed  below,  single  precision 
was  determined  to  be  sufficient.  As  such,  all  floating  point 
numbers  stored  in  main  memory  or  in  MC68000  registers  are 
kept  as  32  bit  single  precision  values.  Format  conversions 
are  performed  automatically  by  the  coprocessor  during 
operand  transfers. 

(6.4)  Applications  Software  Development 

The  application  chosen  to  demonstrate  the  prototype 
multiprocessor  system  was  a  control  mixer  for 
reconfiguration  of  control  laws  of  the  URV  in  the  event  of 
control  surface  failure.  In  short,  this  concept  modifies  the 
control  surface  gain  matrix  to  offset  the  effect  of  a  failed 
surface  by  distributing  control  authority  to  the  other 
control  surfaces  of  the  aircraft.  In  effect,  the  remaining 


surfaces  act  to  compensate  for  the  loss.  Much  control  and 
mathematical  theory  has  gone  into  this  research;  however, 
the  theory  is  outside  the  scope  of  this  project.  Interested 
readers  are  directed  to  the  references  for  more  detailed 
information . 

As  defined  in  [28],  the  linearized  continuous  aircraft 
state  equations  of  the  unimpaired  aircraft  are  given  by 

x(t)  =  A  x(t)  +  Bo  d(t) 

where  x(t)  is  the  aircraft  state  vector,  d(t)  is  the 
aircraft  control  surface  deflection  vector,  and  Bo  is  the 
control  effectiveness  matrix.  Also, 

d(t)  =  Ko  u(t) 

where  u(t)  is  a  pilot  plus  flight  control  system  (FCS)  input 
vector  and  Ko  is  the  control  mixer  gain  matrix.  Expanding 
the  above  equation,  we  get 

x(t)  =  A  x(t)  +  Bo  Ko  u(t). 

As  will  be  discussed  below,  a  failure  changes  the  makeup  of 
Bo.  In  order  to  keep  the  aircraft  model  the  same  (ie: 
tolerate  the  failure) ,  the  quantity  (Bo  Ko)  must  remain 
constant.  As  a  result,  the  Ko  matrix  must  be  adjusted  to 
offset  the  changes  in  Bo.  If  we  use  o  subscripts  to  signify 
matrices  of  the  unimpaired  aircraft,  and  i  subscripts  to 
signify  matrices  of  the  impaired  aircraft,  we  get 


To  solve  for  Ki 
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Ki  =  Bi  Bo  Ko. 

However,  Bi  may  not  be  a  square  matrix,  in  which  case  it  is 
not  invertable.  As  such,  we  use  the  fact  that  a  matrix 
multiplied  by  its  transpose  results  in  a  square  matrix.  As 
shown  in  [28], 
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Ki  =  (Bi'  Bi)  Bi'  Bo  Ko.  (6.1) 

The  derivation  of  this  equation  assumes  that  Bi  is  an  m  X  n 
matrix,  where  m  (number  of  state  equations)  is  greater  than 
n  (number  of  control  surfaces) . 

For  the  purposes  of  the  applications  functions  of  the 
prototype  test.  Bo  is  a  5  X  5  matrix,  where  each  column 
corresponds  to  a  control  surface  on  the  URV.  It  is  assumed 
that  a  control  surface  fails  in  the  center  and  locked 
position.  As  such,  all  entries  in  the  column  corresponding 


to  the 

failed  surface  go  to  zero.  This  resultant  matrix 

is 

Bi. 

If 

Bi 

were 

left  in  this  form, 

we  would  be  unable 

to 

compute 

Ki 

since 

the  quantity  Bi' 

Bi  is 

singular,  and, 

therefore, 

is  not 

invertable.  As  a 

result, 

we  convert 

Bi 

such  that  Bi'  Bi  becomes  potentially  nonsingular  by  removing 
the  zero  column.  This  conversion  leaves  Bi  as  a  5X4 
matrix.  With  Ko  sized  at  5  X  3 ,  Ki  computes  into  a  4  X  3 
matrix.  Note  that  each  row  of  the  Ki  matrix  corresponds  to  a 
control  surface  of  the  aircraft.  As  such,  a  row  of  zeros  can 
be  reinserted  into  the  row  corresponding  to  the  lost 


surface.  The  resultant  gain  matrix,  Ki  is  a  5  X  3  matrix 
with  the  gains  of  the  good  control  surfaces  adjusted  to 
compensate  for  the  loss  of  the  failed  surface.  This 
derivation  can  be  extended  to  multiple  failed  surfaces. 

For  the  demonstration  of  the  prototype,  the  job  to  be 
performed  was  the  real  time  computation  of  the  Ki  matrix  in 
response  to  failed  surfaces.  Previous  tests  on  the  TN17 
aircraft  were  run  using  precomputed  control  mixer  gain 
matrices.  The  matrices  were  derived  on  the  ground  and  placed 
into  the  ROM  memory.  The  8061-based  autopilot  would  simply 
select  the  correct  gain  matrix  given  the  failure  induced. 
Note  that  the  "failure"  is  really  a  control  panel  switch 
which  signals  to  the  autopilot  which  surface  to  "fail". 

Real  time  computation  of  the  control  mixer  gain  matrix 
is  too  intensive  of  a  job  for  the  8061  to  handle  alone.  As 
such,  the  demonstration  of  the  multiprocessor  prototype  has 
been  designed  to  compute  Ki  continuously  on  the  basis  of 
failure  information  supplied  by  the  8061.  For  the  purposes 
of  this  demonstration,  the  basic  autopilot  functions  have 
been  left  in  the  8061  processor.  The  Ki  function  in  equation 
(6.1)  is  broken  into  a  set  of  tasks,  much  of  which  can  be 
run  in  parallel.  These  tasks  are  triggered  by  a  preceding 
task  which  samples  the  failure  data  supplied  by  the  8061. 
The  last  task  in  the  computation  of  Ki  triggers  a  following 
task  which  converts  the  computed  matrix  to  the  form  used  by 
the  8061  and  stores  it  in  dual  port  memory  where  the  8061 
can  access  it.  This  real  time  computation  is  transparent  to 
the  8061  in  that  the  8061  simply  uses  the  gain  matrix 
currently  stored  in  memory.  By  keeping  the  autopilot 


function  in  the  8061,  the  demonstration  of  the  prototype  is 
simplified.  The  modifications  to  the  existing  autopilot  are 
minor,  and  fewer  68000-based  tasks  are  required.  These 
assumptions  do  not  detract  from  the  intent  to  demonstrate 
the  capabilities  of  the  prototype. 

(6.5)  Test  Configuration 

Figure  6.1  gives  the  system  test  configuration  used  for 
the  demonstration  of  the  multiprocessor  prototype.  One 
aspect  of  the  set-up  requires  special  note.  The  8061 
processor  hardware  is  connected  to  one  of  the  MC68000 
processors  through  dual  ported  RAM  on  the  attached  wirewrap 
board.  This  set-up  is  not  the  ultimate  configuration  of  the 
TN21  avionics  system.  As  discussed  in  chapter  4,  the  8061  in 
the  system  will  be  interfaced  through  the  VME  bus.  To 
simplify  the  demonstration  hardware,  the  8061  section  was 
connected  to  one  of  the  MC68000 's  to  eliminate  the  need  for 
VME  interface  hardware  to  be  developed.  The  dual  port 
interface,  in  contrast,  is  simplistic  and  makes  for 
effective  demonstration  of  the  concepts  of  the  prototype. 
The  two  extra  tasks  described  above  were  assigned  to  the 
MC68000  interfaced  to  the  8061  to  make  this  deviation 
somewhat  transparent.  A  later  phase  of  the  development  of 
the  TN21  system  will  include  the  implementation  of  the 
proper  interface  and  software. 

(6.6)  Parallelization  of  the  Ki  Function 

To  divide  a  function,  such  as  equation  (6.1),  into 
tasks  for  a  multiprocessor,  one  must  first  identify  the 
areas  of  potential  parallelism  and  the  areas  of  strict 
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serial  dependancies .  First,  let  us  examine  a  sequential 
algorithm  for  the  computation  of  Ki. 


(1)  Take  the  transpose  of  Bi  (=  Bi') 

(2)  Multiply  Bi'  by  Bi  (=A) 

(3)  Take  the  inverse  of  A  (=C) 

(4)  Multiply  Bi'  by  Bo  Ko  (=D) 

(5)  Multiply  C  by  D  (=Ki) 

Note  that  Bo  Ko  is  a  static  entity,  and  as  such  can  be 
precomputed. 

First  analysis  of  the  above  algorithm  showsG  only  one 
area  of  parallelism.  Step  (4)  is  independant  of  steps  (2) 
and  (3)  and  as  such  can  be  performed  concurrently.  Other 
less  apparent  areas  of  parallelism  exist  if  the  steps  of  the 
algorithm  are  defined  in  more  detail.  For  example,  step  (3) 
involves  a  matrix  inversion  which  is  comprised  of  many 
substeps . 

A  basic  method  for  computing  a  matrix  inverse  is  given 
by 
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A  =  adj  A  /  det  A 

where  det  A  is  the  determinant  of  A  and  adj  A  is  the  adjoint 
adjoint  of  matrix  A  and  is  defined  as 

adj  A  =  (cof  A) ' . 

The  cofactor  of  A,  cof  A,  is  defined  as  the  matrix  whose  row 


i  and  column  j  entry  is  given  by 


i+j 

cof  A  (i,j)  =  minor(  A(i,j)  )  *  (-1). 

The  minor (  A(i,j)  )  is  the  determinant  of  the  submatrix 
obtained  by  eliminating  the  ith  row  and  jth  column  of  A. 
This  process  of  matrix  inversion  is  only  applicable  to 
square  matrices  that  are  nonsingular.  Given  that  Bi'  Bi  is 
always  a  square  matrix,  the  first  requirement  is  satisfied. 
The  second  requirement  that  the  matrix  need  be  nonsingular 
is  less  certain. 

Other  techniques  exist  to  compute  the  pseudoinverse  of 
matrices  in  cases  where  the  matrix  in  question  may  be 
singular.  Examples  include  the  CROUT  [29]  and  singular  value 
decomposition  [30]  techniques.  These  methods  are  certainly 
preferred  for  use  in  actual  flight  systems  performing 
inversions  on  state  matrices,  since  the  values  in  these 
matrices  are  uncertain  and  may  contain  many  very  small  or 
zero  valued  items.  In  fact,  the  assumption  that  the  column 
of  Bi  corresponding  to  the  failed  surface  goes  entirely  to 
zero  makes  Bi  singular  in  its  unmodified  form.  This 
assumption  forced  the  removal  of  that  column  of  Bi  in  order 
to  use  the  above  basic  inversion  method.  The  reason  why  the 
basic  technique  is  used  in  the  prototype  demonstration  is 
that  it  is  highly  computationally  intensive,  and  as  such,  is 
a  suitable  test  of  the  capabilities  of  the  prototype.  Also, 
the  basic  technique,  in  contrast  to  the  pseudoinverse 
techniques,  is  straightforward  in  implementation,  thereby 
simplifying  the  prototype  demonstration  effort. 

Condensing  the  basic  inversion  function,  we  get 
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A  =  (cof  A) '  /  det  A. 

This  function  can  be  accomplished  by  the  following 
sequential  algorithm  replacing  step  (3)  from  above: 


(3a)  Compute  det  A 
(3b)  For  each  A(i,j) 

(3c)  Compute  E  =  det  (  minor  (  A(i,j)  )  ) 

(3d)  If  i+j  is  odd  then  E  =  -E 
(3e)  E  =  E  /  det  A 
(3f )  Store  E  at  C(j,i) 

(3g)  Next  A(i, j) 

Obviously,  steps  (3b)  to  (3g)  involve  independent 
computations  for  each  element  of  A.  Each  element  can, 
therefore,  be  handled  concurrently. 

In  as  much  as  the  matrix  inversion  step  can  be  broken 
into  smaller,  potentially  parallel  substeps,  so  can  the 
matrix  multiplications  and  determinants.  These  functions 
also  involve  independent  threads  of  computation  which  can 
increase  parallelism.  However,  given  the  matrix  sizes 
discussed  above,  it  does  not  seem  beneficial  to  enforce 
parallelism  at  this  fine  a  level.  The  computations  in 
parallel  should  be  as  intensive,  or  course  grained,  as  to 
justify  the  added  intertask  communications  and  kernel 
overhead  costs.  Also,  the  number  of  processors  envisioned 
(six  to  eight  maximum)  and  the  concentration  of  parallel 
tasks  already  in  the  execution  regions  of  the  functions 
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lessens  the  impact  of  the  increased  parallelism.  As  a 
result,  the  breakdown  of  the  Ki  algorithm  has  been  performed 
to  a  sufficient  level  to  define  task  units.  Figure  6.2 
graphically  shows  the  algorithm  with  its  sequential  and 
parallel  sections. 

Various  methods  exist  to  convert  the  above  algorithm  to 
tasks  and  communications.  Figures  6.3  and  6.4  show  two  such 
methods.  In  Figure  6.3,  the  tasks  operate  in  a  dataflow-like 
manner.  A  task  executes  only  after  receiving  inputs  or  a 
signal  to  proceed.  Each  task  is  of  the  form: 

(1)  Receive  inputs 

(2)  Compute  results 

(3)  Send  outputs. 

The  implementation  in  Figure  6.4  is  similar,  but  takes 
the  appearance  of  a  main  program  and  subroutines.  The  KI 
task  controls  the  ordering  of  calls  to  the  four  tasks  it 
communicates  with.  In  turn,  the  tasks  that  the  C  task 
communicates  with  are  like  subprograms  local  to  the  C  task. 
Data  movement  is  bidirectional,  unlike  in  Figure  6.3  where 
it  is  unidirectional.  The  numbering  in  Figure  6.4  represents 
the  sequencing  order  of  the  "calls".  "Calls"  of  equal 
sequence  number  are  concurrent.  Task  communications  of  this 
type  are  similar  to  those  termed  remote  procedure  calls  in 
other  literature  [31]. 

The  remote-procedure-call-like  format  has  been  utilized 
in  the  TN21  prototype  system  due  to  the  similarities  to 
conventional  programming.  The  following  sequence  of  macro 
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instructions  implement  the  KI  task  "mainline"  routine  from 
Figure  6.4: 


р. poll 

с. poll 
p.poll 

р. poll 

с. poll 
c.poll 

р. poll 

с. poll 


transstart , KIdata 
transend , dO-dl , KIdata 
Cstart, KIdata 
m453start, KIdata 
m453end, dO-dl , KIdata 
Cend, dO-dl , KIdata 
m4  4  3 start , KIdata 
m443end, dO-dl, KIdata 


A  "procedure  call"  is  implemented  by  a  corresponding 
p.poll/c.poll  pair  of  macro  instructions.  Parallel 
"procedure  calls"  are  made  by  multiple  p.poll  instructions 
in  seguence.  Recall  that  only  c.poll  instructions  wait  for 
data  exchanges.  As  a  result,  the  initiation  of  "calls"  to  C 
and  m453  above  allow  parallel  operation  of  remote  procedure 
call  tasks. 

Some  parts  of  the  algorithm  have  strict  serial 
dependancies.  For  example,  the  transpose  of  Bi  has  to  be 
performed  before  all  other  parts  of  the  algorithm.  As  Figure 
6.2  graphically  demonstrates,  the  fastest  the  algorithm  can 
be  completed  is  roughly  the  sum  total  of  the  times  to  do  a 
transpose,  two  matrix  multiplications,  and  a  determinant. 
However,  this  alternative  is  much  better  than  the  sequential 
cost  of  a  transpose,  three  matrix  multiplications,  and 
seventeen  determinants  (sixteen  of  which  are  a  part  of 
cofactor  derivations) .  Chapter  7  will  discuss  the  impact  of 
parallel  processing  on  the  algorithm. 


(6.7)  Allocation  Of  Tasks  Onto  Multiple  Processors 

The  following  is  the  list  of  tasks  used  in  the  Ki 
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computation : 


KI 

trans 

m453 

C 

m4  54 
Ainv 
det4 

det30-det3 15 
m44  3 


"mainline" 

take  transpose  of  Bi 
compute  D  =  Bi'  Bo  Ko 

initiate  computation  of  C  =  (Bi'  Bi  ) 

compute  A  =  Bi'  Bi 

initiate  determinants,  compute  A 

4X4  determinant  of  A 

3X3  determinants  of  minor (  A(i,j)  ) 

compute  Ki  =  C  D 


Each  task  is  given  equal  priority  for  the  ready  queue.  The 
most  efficient  operation  can  be  accomplished  through  proper 
ordering  on  the  timer  queue,  with  release  times  assigned 
accordingly. 

The  assignment  of  tasks  to  a  single  processor  is 
somewhat  simple.  The  proper  ordering  on  the  timer  queue  is 
the  same  as  the  order  of  execution  in  the  serial  algorithm. 
As  such,  the  above  ordering  works  best  for  the  single 
processor  case.  Timer  queue  release  times  have  little  effect 
on  the  overall  performance  since  trans,  m453,  m454,  m443, 
and  the  determinant  tasks  will  all  execute  until  completion 
once  initiated.  As  such,  little  polling  will  occur. 

In  the  case  of  multiple  processors,  the  situation  is 
not  as  simple.  Care  must  be  taken  to  evenly  distribute  the 
processing  load  over  the  processors,  taking  execution  time 
into  account.  The  important  idea  is  to  keep  potentially 
parallel  tasks,  such  as  m454  and  m453,  operating 
concurrently.  The  placement  of  the  c.poll  and  p.poll 
instructions,  as  demonstrated  above  for  for  the  KI  task, 
defines  to  a  large  extent  this  parallelism.  Two  other  basic 
rules-of-thumb  are  as  follows.  Obviously,  the  tasks  that  can 
execute  in  parallel  should  be  separated  onto  different 


processors  to  allow  for  true  concurrent  operation.  To  insure 
that  the  separated  tasks  can  operate  efficiently  in 
parallel,  a  second  rule-of-thumb  involves  the  proper  use  of 
timer  queue  release  times  to  limit  delays  in  initiation  of 
tasks  due  to  polling.  For  example,  the  determinant  tasks 
(seventeen  in  number)  cannot  possibly  start  until  the 
transpose  and  first  two  matrix  multiplications  are  finished. 
If  the  determinant  tasks'  PCB's  are  loaded  into  the  ready 
queue  at  the  same  time  as  the  other  tasks',  the  mass  of 
tasks  involved  in  polling  could  cause  delays  in  tasks 
getting  started,  thereby  degrading  the  effect  of 
parallelism. 

Figures  6.5  and  6.6  give  the  timer  queue  orderings  for 
the  processors  in  the  two  and  three  processor  configurations 
respectively.  From  these,  one  can  see  that  the  main  effect 
of  parallelism  as  more  processors  are  added  is  in  the 
determinate  tasks.  Obviously,  the  best  performance  to  be 
expected  would  involve  the  use  of  seventeen  processors; 
however,  the  small  performance  gain  would  not  justify  the 
hardware  costs.  The  next  chapter  will  address  performance 
measures  determined  for  this  algorithm. 

(6.8)  8061  Hardware  Design 

No  new  schematics  were  required  for  the  8061  circuit 
used.  The  existing  autopilot  design  provided  all  the 
required  memory  and  I/O  interfaces.  The  only  modification 
needed  was  a  change  in  the  type  of  RAM  memory  chips  used. 
Given  that  the  68000-to-8061  interface  dual  port  RAM  was 
already  specified  for  use,  these  chips  were  substituted  in 
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the  schematic  design  for  the  static  RAM  chips  previously 
there.  The  50-pin  connector  used  for  I/O  interface  was  left 
to  allow  direct  connection  to  a  simulation-interface  box 
used  in  previous  URV  autopilot  hardware- in- the- loop 
simulations.  This  configuration  allows  the  prototype  to  be 
"plugged"  into  the  simulation  facilities  identically  as  in 
previous  tests.  The  resultant  circuit  occupies  the  attached 
wirewrap  board  space  as  shown  in  Figure  6.7. 

(6.9)  8061  Software  Modifications 

As  was  mentioned  previously,  the  existing  autopilot 
functions  were  left  in  the  8061  processor  to  simplify 
software  changes.  Some  changes  were  required,  however,  to 
link  the  68000-multiprocessor-based  control  mixer  software 
with  the  basic  autopilot.  The  previous  8061  control  mixer 
software,  consisting  of  a  matrix  selection  algorithm  based 
upon  failure  number,  was  removed  and  replaced  with  a  routine 
to  write  the  failure  number  into  the  dual  port  memory  and 
assign  the  dual  port  gain  matrix  address  for  all  failure 
cases.  As  stated  before,  the  68000  multiprocessor  will  use 
the  failure  information  to  compute  and  store  the  appropriate 
gain  matrix  for  the  8061  autopilot  to  use. 

(6.10)  Hardware-in-the-Loop  Simulation  Configuration 

The  configuration  of  the  simulation  tests  is  given  in 
Figure  6.8.  The  prototype  hardware  is  connected  to  a  device 
which  contains  servos  corresponding  to  each  of  the  control 
surfaces  on  the  URV.  The  prototype  hardware  commands  these 
servos  directly.  The  simulation  computer  detects  the  servo 
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movement  as  input  to  the  airframe  simulation.  The  graphic 
display  station  also  provides  inputs  to  the  simulator 
computer  in  the  form  of  pilot  commands.  Rudder,  elevator, 
aileron,  and  throttle  command  input  devices  are  located  at 
this  station.  The  simulation  computer  returns  sensor  data  to 
the  prototype  hardware  through  the  servo  interface  box  and 
aircraft  state  data  to  the  graphic  display  station. 

Surface  failure  is  accomplished  via  switches  on  the 
servo  interface  box.  A  failure  switch  exists  for  each  of  the 
URV  control  surfaces.  The  switch  settings  are  converted  by 
the  servo  interface  box  to  inputs  to  the  8061  circuit  in  the 
prototype.  Visual  confirmation  of  the  failure  can  be  made 
during  a  test  run  by  monitoring  the  servo  corresponding  to 
the  failed  surface.  Once  failed,  the  appropriate  servo 
ceases  to  move. 

With  this  configuration,  the  URV  multiprocessor 
prototype  can  be  tested  in  a  real  time  environment.  The 
graphic  display  provides  simulated  artificial  horizon, 
attitude,  speed,  climb  rate,  angle  of  attack,  and  side  slip 
angle  indicator  devices.  With  these,  the  "test  pilot"  can 
verify  the  operation  of  the  hardware  under  test  with  an 
environment  similar  to  an  actual  URV  flight.  This  is 
important,  especially  when  evaluating  performance  of  the 
control  mixer  under  failure  conditions. 


(7)  Prototype  Development  and  Demonstration  Results  and 
Findings 

The  purpose  for  producing  and  demonstrating  a 
prototype,  such  as  the  one  in  this  research  effort,  is  to 
prove  the  anticipated  benefits,  identify  problems  areas, 
establish  areas  of  further  research,  and  provide  data  for 
use  in  later  system  development  stages.  The  development  of 
the  TN21  prototype  multiprocessor  system  realized  these 
goals.  The  anticipated  benefit  of  high  throughput  capability 
was  demonstrated  through  the  significant  decrease  in  time  to 
compute  a  complex  arithmetic  function  with  only  a  few 
processors.  Several  problem  areas  have  been  discovered  and 
corrected,  including  some  in  the  applications  functions 
area.  Data  has  been  collected  on  hardware  and  software 
performance.  These  aspects  will  be  discussed  in  this 
chapter.  Those  areas  requiring  further  research  will  be 
identified  in  Chapter  8. 

(7.1)  VME  Performance  and  Bus  Loading 

Chapter  4  addressed  the  concerns  anticipated  in  the  use 
of  a  single  bus.  The  claim  made  was  that,  with  sufficient 
bus  bandwidth  and  limited  numbers  of  processors  connected, 
bus  loading  would  not  become  significant.  To  back  up  this 
claim,  analyses  were  made  for  the  VME  bus,  both  in  the 
theoretical  and  physical  cases. 
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(7.1.1)  VME  Access  Options 

A  discussion  of  VME  access  options  (32]  is  first 
required.  The  system  bus  controller  on  a  VME  bus  can 
implement  a  priority-based  or  round-robin-based  arbitration 
scheme.  The  VME  bus  has  four  priority  levels.  Each  potential 
bus  master  is  assigned  to  one  level.  The  assignment  of  a  bus 
in  the  priority  mode  of  operation  is  based  upon  those 
priority  levels.  A  processor  with  a  higher  priority  level  is 
assigned  the  bus  before  a  processor  with  a  lower  priority. 
The  round-robin  mode,  in  contrast,  offers  a  scheme  based  on 
fairness.  The  assignment  of  the  bus  is  rotated  between 
priority  levels.  As  such,  no  given  priority  level  can  be 
"starved"  from  bus  access.  The  VME  boards  purchased  provide 
for  the  selection  of  either  option?  however,  in  the  analyses 
to  follow,  round-robin  is  assumed. 

Once  a  processor  is  granted  the  bus,  it  may  either  hold 
the  bus  for  the  entire  time  required,  or  only  for  a  single 
bus  transfer,  depending  upon  the  release  option.  In  the 
release-when-done  (RWD)  option,  the  bus  master  has  control 
of  the  bus  until  it  has  completed  all  of  its  transfers.  In 
the  release-on-request  (ROR)  option,  another  bus  master  is 
assigned  after  the  current  transfer,  if  a  request  for  the 
bus  is  made.  Minimal  access  latency  is  achieved  with  the  ROR 
option.  In  the  purchased  boards  and  the  analyses  to  follow, 
this  option  is  used. 

Bus  arbitration  may  either  be  handled  during  the 
current  bus  cycle  or  after  the  bus  cycle.  Obviously,  maximum 
bus  bandwidth  is  achieved  with  concurrent  bus  usage  and 
arbitration.  This  option  is  used  on  the  hardware  and  in  the 
analyses. 
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(7.1.2)  VME  Bus  Access  Latency  Effects  (Theory) 

If  n  processors  are  to  use  a  single  bus  and  if  true 
fairness  is  assumed,  a  processor  may  have  to  wait  up  to  n-1 
bus  transfer  cycles  to  get  access.  A  basic  MC68000  memory 
cycle  takes  four  clock  cycles  to  complete,  if  no  wait  states 
are  applied.  At  a  processor  clock  speed  of  10  MHz,  2.5 
million  transfers  per  second  can  be  made.  This  translates  to 
a  memory  access  time  of  400  nanoseconds  (nsec) . 

The  VME  bus  can  be  viewed  as  an  extension  to  the 
MC68000  bus.  To  account  for  signal  propogation  over  the  VME 
bus  and  VME  bus  arbitration,  let  us  assume  an  additional 
clock  cycle  per  transfer.  This  assumption  increases  the 
single  transfer  time  to  500  nsec.  If  six  processors  are 
connected  to  the  VME  bus,  the  maximum  bus  access  latency 
would  be  2.5  microseconds  (usee).  This  latency  is  on  the 
order  of  the  execution  time  of  all  MC68000  instructions  at  a 
processor  clock  speed  of  10  MHz.  If  the  ratio  of  non-bus 
accesses  to  bus  accesses  is  sufficiently  high,  the  added 
time  for  off-board  memory  cycles  is  not  too  significant. 

Consider,  however,  a  case  where  the  ratio  is  not  very 
large.  A  block  move  loop  is  a  near  worst  case  example. 

loopl  move.w  do, (aO) 

cmpa.l  a0,al 

bne  loopl 

The  above  loop  takes  2.4  usee  at  10  MHz  for  local  memory 
accesses.  If  we  use  the  latency  time  of  2.5  usee  above  and 
assume  the  worst  case  of  always  suffering  the  maximum 
latency,  the  loop  will  take  approximately  twice  as  long  over 
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the  VME  bus  as  for  local  memory  accesses.  However,  note  that 
the  latency  assumptions  were  based  upon  six  processors 
competing  for  the  bus.  Even  if  all  six  experienced  a 
simulataneous  100  percent  increase  in  computation  time  using 
near  worst  case  loops  such  as  the  one  above,  the  effective 
system  parallel  speed  up  is  a  factor  of  three.  Translating 
down  to  a  three  processor  contention  scenario,  we  would  see 
a  50  percent  increase  in  computation  time  and  a  speed  up 
factor  of  2.25. 

(7.1.3)  VME  Bus  Access  Latency  Effects  (Measured) 

To  further  illustrate  the  effects  of  bus  access  latency 
under  the  assumptions  of  the  research  program,  a  test  was 
run  on  the  project  hardware  to  measure  actual  contention 
effects.  The  RAM  test  function  included  in  the  debugger 
monitor  utilizes  tight  loops  similar  to  the  one  given  above. 
By  installing  the  monitor  code  on  each  of  multiple  processor 
boards  and  utilizing  the  RAM  test  function  on  each  to  access 
off-board  memory,  a  scenerio  like  the  one  above  can  be 
tested.  Note  that  the  assumption  of  always  suffering  the 
worst  case  latency  will  not  apply  here. 

Utilizing  two  processors,  a  RAM  test  covering  32  Kbytes 
of  memory  took  55.6  seconds  under  contention  conditions.  The 
same  test  took  48.2  seconds  without  contention.  This 
translates  to  approximately  a  15  percent  increase  in 
computation  time.  Performing  the  same  test  with  three 
contending  processors  resulted  in  an  execution  time  of  59.8 
seconds  (24  percent  increase) .  Note  that  the  measured 
results  are  considerably  less  than  the  calculated  case  in 
the  previous  section;  testimony  to  the  fact  that  the  worst 
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case  latency  is  not  frequently  realized.  The  three 
processors  can  achieve  a  2.4  factor  of  speed  up  in 
contention  traffic  similar  to  the  RAM  tests.  A  six  processor 
configuration  would  certainly  achieve  a  much  better  speed  up 
factor. 

(7.1.4)  VME  Bus  Access  Latency  Effects  (Conclusions) 

The  result  of  these  simplistic  analyses  is  that  the  VME 
bus  is  sufficient  for  the  TN21  multiprocessor  system,  given 
the  assumption  of  six  to  eight  processors  maximum.  Near 
worst  case  loops,  such  as  the  one  given  previously,  will  not 
typically  appear  in  actual  applications  code,  simulateously 
on  multiple  processors.  Block  transfer  loops  to  shared 
memory  do  appear  in  the  p.poll  and  c.poll  routines.  However, 
non-VME  access  to  VME  access  ratios  of  200:1  or  greater  are 
typical  in  the  Ki  computation  routines  (25  usee  transfer 
time  to  5  milliseconds  (msec)  computation  time) .  As  a 
result,  contention  is  minimal  and  the  concern  of  "bus 
bottleneck"  is  alleviated.  As  mentioned  before,  this 
analysis  only  applies  to  non-fine-grained  compuations.  The 
applications  on  the  TN21  multiprocessor  are  assumed  to  be 
coarse  grained.  Shared  memory  is  used  for  data  transfer  and 
storage.  Local  memory  is  used  for  intermediate  results, 
around  which  most  computation  time  is  spent. 

(7.2)  Execution  Times  for  Kernel  Routines 

The  performance  of  the  multiprocessor/multitasking 
kernel  is  a  key  element  in  the  overall  system  computation 
efficiency.  Context  switches  were  one  such  aspect  of  kernel 
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performance  discussed  in  Chapter  5.  Kernel  performance 
relates  directly  to  overhead,  which  is  time  not  spent  on 
applications  functions. 

(7.2.1)  RTOS  Comparisons 

[16]  gives  some  typical  timings  of  RTOS's.  In  general, 
context  switch  times  of  75  to  150  usee  can  be  expected  with 
these  commercial  products  (processor  clock  speeds  unknown) . 
The  times  solely  represent  the  time  to  switch  tasks  on  the 
processor.  Other  timings,  such  as  for  calls  for  clock 
acquisition  or  mailbox  utilization,  are  not  given  in  the 
comparison  table;  however,  typical  values  of  200  usee  for 
service  calls  are  mentioned  in  the  text  of  the  reference.  In 
all  fairness,  the  computation  of  such  f igures-of-merit  are 
difficult,  given  the  variable  conditions  present  (number  of 
tasks  in  the  system,  interrupts,  etc) .  Some  attempt  will  be 
made  to  quantify  these  for  the  URV  RTMOS . 

(7.2.2)  RTMOS  Timings 

The  following  is  a  list  of  execution  times  for  typical 
kernel  service  routines  implemented  as  MC68000  TRAP'S  in  the 
RTMOS. 


Exit 

Release-on-poll 

Sleep 

Report-to-system 


85.6  usee 
92.8  usee 
91.4  +  9n  usee 

38.6  usee 


where  n  =  number  of  positions  from  the  front  of  the 
timer  queue  where  the  task  is  placed 


Note  that  the  first  three  times  comprise  the  basic  context 
switch  routines  of  the  RTMOS  .  All  are  comparable  to  the 


RTOS  times  referenced  above. 


Other  kernel  timings  are  more  critical.  For  example, 
the  p.poll  and  c.poll  racros  may  include  release-on-poll 
kernel  calls,  but  also  contain  other  areas  of  overhead.  The 
p.poll  macro  takes  34.4  +  2.8x  +3.4y  usee,  where  x  is  the 
number  of  times  attempting  the  MC68000  test-and-set  (TAS) 
instruction  used  for  data  structure  mutual  exclusion  and  y 
is  the  number  of  data  words  transferred.  Typically,  a  p.poll 
will  take  less  than  50  usee;  however,  if  larger  amounts  of 
data  (such  as  a  matrix)  are  transferred,  the  time  may 
increase  to  around  100  usee. 

C.poll  takes  37.2  +  2.8s  +  100. 4p  +2.8tp  +  3.4q  +1.6r 
usee  to  complete,  where  s  and  t  are  counts  of  TAS 
executions,  p  is  the  number  of  times  through  the  poll  wait 
loop  (with  release-on-poll),  q  is  the  number  of  data  words 
transferred,  and  r  is  the  number  of  32  bit  registers  saved. 
The  minimum  c.poll  time,  assuming  no  waiting  on  TAS 
instructions  or  polling  variables,  is  45  usee.  More 
realistically,  however,  s=t=2  and  p=l.  If  we  assume  q=2  and 
r=8,  the  time  for  c.poll  is  168.4  usee.  As  with  p.poll,  if 
more  data  is  transferred,  the  time  will  increase 
accordingly. 

The  tick  function  interrupt  service  routine  (ISR)  is 
another  critical  consideration  in  kernel  performance.  The 
following  is  a  list  of  functions  performed  within  tick  and 
the  times  resulting: 


Set  up  and  clock  update  :  15.2  usee 

Get  task  from  timer  queue  :  3.6  +  41.6m  usee 

Time  slice  path  :  117.6  usee 

No  time  slice  path  :  20.4  usee 

Interrupt  handling  :  4.4  usee 

where  m  =  number  of  PCB's  released  form  the  timer  queue. 

The  total  tick  processing  time  is  140.8  +  41.6m  usee  if  a 

time  slice  is  performed  and  43.6  +  41.6m  if  not.  Timer  queue 

staggering  of  tasks  may  be  desireable  to  prevent  long  tick 
ISR  times  which  can  significantly  delay  applications  tasks 
processing.  One  way  or  the  other,  the  price  in  time 
eventually  has  to  be  paid. 

(7.2.3)  Implications  of  the  RTMOS  Timing  Data 

The  kernel  functions  discussed  above  were  not  optimized 
for  minimal  execution  time.  Optimization  is  for  final 
products,  such  as  the  RTOS  products  referenced,  not  for 
prototype  systems.  However,  in  all  likelihood,  very  little 
significant  improvement  would  be  expected.  At  any  rate,  the 
times  noted  are  comparable  or  better  than  those  for 
commercial  products. 

One  area  where  improvement  could  be  made  is  in  the 
timer  queue  storage  routines.  The  current  implementation 
searches  the  timer  queue  from  front  (least  time  to  release) 
to  rear  to  determine  where  the  task  should  be  placed.  In 
actual  use,  most  tasks  would  probably  be  placed  closer  to 
the  end  of  the  queue.  As  such,  minimal  search  time  would 
likely  be  achieved  by  beginning  at  the  rear  of  the  queue, 
moving  forward. 

The  most  critical  implication  of  the  timing  data  given 
involves  the  tick  routine.  A  minimum  tick  execution  would 
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take  43.6  usee.  Since  a  tick  takes  place  once  per 
millisecond,  an  automatic  minimum  overhead  of  4.4  percent  is 
realized.  One  time  slice  or  a  couple  of  releases  from  the 
timer  queue  would  increase  this  to  around  140  usee,  or  14 
percent.  Although  the  means  to  improve  the  routine 
performance-wise  have  not  yet  been  investigated,  any 
optimization  work  on  the  kernel  should  be  concentrated  on 
the  tick  routine  first.  Interestingly  enough,  none  of  the 
RTOS's  referenced  give  data  on  this  timing  characteristic, 
although  all  must  have  similar  functions. 

(7.3)  Applications  Computation  Times 

As  with  the  kernel  functions  above,  the  applications 
tasks  developed  for  the  prototype  were  not  optimized.  For 
example,  more  extensive  use  of  the  floating  point  registers 
within  the  MC6888 1  floating  point  coprocessor  as 
accumulators  would  have  saved  many  processor-coprocessor 
data  transfers.  Optimization  of  these  routines,  however, 
would  not  have  served  to  demonstrate  the  goals  of  this  phase 
of  development.  Of  more  importance  to  these  goals  is  the 
demonstration  of  significant  speed  up  and  minimal  overhead 
when  parallelism  is  applied  to  the  problem. 

(7.3.1)  Sequential  Limitations 

As  discussed  in  Chapter  6,  Figure  6.3  demonstrates  that 
the  fastest  time  to  compute  the  Ki  function  can  be 
approximated  by  the  sum  total  of  the  times  to  do  the 
transpose,  two  matrix  multiplications,  and  a  4X4 
determinant.  The  times  for  these  are  given  by 


trans  :  .82  msec 
m454  :  10.60  msec 
m443  :  6.78  msec 
det4  :  6.68  msec 

The  sequential  limitations  of  the  algorithm,  therefore, 
bound  the  computation  time  to  no  better  than  about  25  msec, 
no  matter  how  much  parallelism  is  applied. 

(7.3.2)  Measured  Computation  Times 

Measurements  of  the  actual  times  to  compute  the  Ki 
algorithm  were  taken.  Before  the  algorithm  was  parallelized, 
it  was  developed  as  a  single  processor,  sequential  program. 
This  version  took  59  msec  to  complete.  The  parallelized 
version  required  shared  memory  data  exchanges  via  c.poll  and 
p.poll  instructions  acting  as  remote  procedure  calls.  Also, 
kernel  overhead  had  some  effect.  The  resulting  times 
measured  for  the  parallelized  version  using  RTMOS  are  as 
follows: 

1  processor  :  69  msec 

2  processors  :  47  msec 

3  processors  :  36  msec 

As  expected,  the  single  processor  version  using  the  RTMOS 
suffers  some  overhead  penalties  (17  percent  total) .  As 
parallelism  is  applied,  however,  the  execution  time  drops 
accordingly. 

As  confirmation.  Figure  7.1  demonstrates  the  division 
of  tasks  onto  three  processors.  The  fast  st  computation  time 
for  this  division  is  the  sum  total  of  the  transpose,  two 
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Figure  7.1 


matrix  multiplications,  one  4X4  determinant,  and  six  3X3 
determinants.  Each  3X3  determinant  was  measured  to  take 
1.74  msec.  The  total,  therefore,  is  35.32  msec.  This  figure 
confirms  the  total  algorithm  computation  time  for  three 
processors  as  given  previously. 

(7.3.3)  Implications  of  Execution  Time  Data 

One  extra  data  point  can  be  derived  from  the  three 
processor  timing  confirmation  in  the  previous  section.  Note 
that  the  difference  between  the  total  algorithm  time  and  the 
sum  total  of  the  sequential  parts  is  less  than  one 
millisecond.  This  translates  to  an  overhead  contribution  of 
around  3  percent  for  context  switches  and  tick  handling.  A 
previous  section  had  predicted  a  minimum  overhead 
contribution  of  4.4  percent  just  for  tick  compuations.  These 
figures  are  fairly  close,  and  serve  to  confirm  expectations. 
They  also  indicate  that  context  switching  times  are 
negligable.  This  finding  is  significant  in  that  it 
demonstrates  that  polling  does  not  contribute  a  high  cost  to 
the  overall  results. 

The  17  percent  increase  in  execution  time  from  the 
sequential  to  single  processor  RTMOS  versions  can  be 
attributed  to  two  main  factors.  First,  the  overhead 
contribution  of  tick  varies  from  a  minimum  of  about  4 
percent  to  as  much  as  33  percent  during  a  time  period  when 
eight  tasks  are  released  from  the  timer  queue.  These  peaks 
are  rare,  but  still  contribute  to  the  overall  overhead 
figures. 

Secondly,  the  choice  was  made  in  the  applications  tasks 
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design  stage  to  pass  entire  matrices  through  some  p.poll  and 
c.poll  exchanges.  In  the  sequential  version,  all  accesses 
were  local,  so  only  addresses  were  passed  between  routines. 
The  process  of  passing  matrices  is  time  consuming  and  could, 
in  some  cases,  be  replaced  by  matrix  addresses.  A  side 
effect  to  this,  however,  is  that  intermediate  computations 
would  take  place  out  of  shared  memory,  rather  than  local 
memory . 

(7.4)  Timer  Queue  Utilization  Timing 

One  hypothesis  made  in  the  early  stages  of  the  URV 
RTMOS  concept  development  was  that  proper  timer  queue 
release  time  staggering  would  be  required  in  order  to  limit 
the  adverse  effects  of  polling  on  performance.  As 
demonstrated  in  the  previous  section,  RTMOS  polling  has  a 
minimal  effect  with  coarse  grained  parallel  computing.  This 
finding  would  seem  to  indicate  that  staggering  is  less 
critical  than  expected.  As  confirmation,  the  applications 
problem  was  run  with  RTMOS  on  a  single  processor  using  and 
not  using  timer  queue  staggering.  The  result  was  an 
identical  69  msec  execution  time  for  both  versions. 

Although  the  lessened  effect  of  polling  can  be 
attributed  somewhat  to  this  finding,  another  reason  was 
discovered.  After  one  iteration  through  the  ready  queue  and 
processor,  the  timer  queue  becomes  naturally  staggered, 
regardless  of  its  initial  state.  This  effect  is  caused  by 
the  execution  time  of  tasks  on  the  processor  and  the 
subsequent  delay  in  successor  tasks  beginning.  Tasks  are 
placed  back  on  the  timer  queue  as  they  complete;  and,  as 
such,  have  their  next  release  time  delayed  accordingly.  As  a 
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result,  only  the  first  iteration  is  affected  by  the  non¬ 
staggering  order.  All  other  iterations  become  naturally 
ordered  and  execute  at  optimal  speed. 


(7.5)  Results  of  the  Prototype  Hardware-in-the-Loop 
Simulation 

The  prototype  hardware,  RTMOS ,  and  control  mixer 
applications  tasks  were  tested  under  the  simulation 
conditions  described  in  Chapter  6.  The  simulation  responded 
to  the  failures  induced  much  as  expected.  The  failure 
responses  will  be  discussed  below.  In  the  development  and 
initial  tests  of  the  applications  tasks,  errors  in  the  Bo 
matrix  were  discovered.  The  problem  and  the  corrections  used 
will  also  be  discussed.  As  mentioned  in  Chapter  6,  the 
reader  is  directed  to  the  references  and  other  literature 
for  further  details  concerning  flight  control,  the  URV 
aircraft,  and  the  control  mixer. 


(7.5.1)  Simulation  Response  to  Failures 

The  control  mixer  model  used  was  a  five  control  surface 
model  utilizing  two  elevators,  two  ailerons,  and  a  rudder.  A 
later  model  [33]  utilizes  two  additional  surfaces  (flaps)  on 
the  URV  which  can  be  used  to  provide  better  response  to 
failures,  particularly  in  the  roll  axis.  However,  lack  of 
sufficient  information  on  this  later  model  during  the 
development  stages  prevented  its  use.  Still,  the  earlier 
model  provided  for  a  sufficient  computational  load  to  test 
the  multiprocessor. 

The  response  to  elevator  failures  was  the  best  of  all 
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cases.  When  one  of  the  elevators  was  failed,  the  mixer 
provided  double  the  authority  to  the  other  elevator  in  the 
pitch  axis,  and  utilized  the  ailerons  to  assist  in  pitching 
the  aircraft.  The  resultant  response  was  suitable  pitching 
control  with  a  slight  initial  roll  (corrected  by  the 
autopilot) .  The  roll  seemed  to  be  induced  by  the  aileron 
movement  commanded  by  the  mixer  to  assist  the  single 
elevator  in  pitching.  Roll  and  yaw  motion  was  not  affected 
by  an  elevator  failure,  as  expected. 

Aileron  failure  response  was  not  as  good.  As  confirmed 
by  off-line  derivation  of  the  gain  matrix,  the  mixer 
response  to  an  aileron  failure  was  to  effectively  zero  out 
the  gain  to  the  other  aileron  and  the  rudder  in  the  roll 
axis  so  that  only  the  elevators  were  used  to  roll  the 
aircraft.  The  result  was  a  sluggish  roll  response  with 
significant  downward  pitching  motion.  This  response  was  to 
be  expected  since  the  elevators  have  far  more  force  in  the 
pitch  axis  than  in  the  roll  axis.  A  better  response  would  be 
to  give  more  authority  to  the  remaining  aileron,  such  as  was 
done  for  the  remaining  elevator  in  the  failed  elevator  case 
discussed  above.  This  observed  response  comes  directly  from 
the  control  mixer  algorithm  and  URV  model.  The  prototype 
calculations  were  confirmed  by  off-line  matrix  computations. 

As  discussed  in  [28],  the  rudder  failure  response,  as 
computed  by  the  control  mixer,  leads  to  unstable  aircraft 
control.  This  is  due  to  the  fact  that  the  elevators  and 
ailerons  do  not  have  sufficient  authority  in  the  yaw  axis  to 
compensate  for  a  rudder  failure.  As  a  result,  the  gains 
become  excessive  and  the  surfaces  saturate.  Again,  this 


response  is  a  control  mixer  problem,  not  a  prototype 
calculations  problem. 

(7.5.2)  Bo  Value  Errors 

In  the  development  of  the  control  mixer  applications 
tasks,  errors  in  the  Bo  matrix  were  discovered. 
Documentation  on  later  URV  control  mixer  work  [33]  was 
located,  and  an  updated  Bo  was  used.  The  primary  difference 
between  the  original  Bo  matrix  and  the  later  version 
involved  surface  polarities.  Differences  existed  between  the 
surface  direction  assumptions  made  in  the  early  stages  of 
control  mixer  development  and  the  actual  aircraft  set  up. 
These  differences  were  corrected  in  the  later  work.  With  the 
new  Bo  matrix  installed,  the  expected  responses,  noted 
above,  were  observed. 

(7.5.3)  Potential  Resolution  Problems 

One  final  observation  on  the  control  mixer  applications 
used,  and  the  8061  autopilot,  concerns  the  resolution  of 
numbers  calculated  for  the  gain  matrix.  The  numbers  derived 
for  the  Ki  gain  matrix  ranged  from  less  than  .001  to  around 
25.  However,  the  conversion  algorithm  used  to  change  the 
numbers  to  a  form  usable  by  the  8061  autopilot  only  allows 
for  four  bits  of  resolution  to  the  right  of  the  decimal 
point  (fixed  point  format) .  As  a  result,  the  smallest  gain 
magnitude  greater  than  zero  possible  is  .0625.  Any  gain 
magnitude  smaller  than  this  will  be  converted  to  zero.  The 
conversion  truncation  is  the  reason  why  the  remaining 
aileron  and  rudder  gains,  in  the  failed  aileron  case,  are 
zero.  Some  authority  is  actual  assigned  in  the  floating 
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point  calculations,  but  the  gains  are  below  the  truncation 
threshold,  and  are  lost  in  the  conversion  process.  Further 
work  may  be  needed  in  this  area  if  further  work  on  the 
control  mixer  is  to  be  performed  on  the  prototype. 


I 


(8)  Conclusions 


In  order  to  make  effective  use  of  the  new  vehicle 


airframe  being  designed  and  constructed,  and  to  provide  high 


speed  computing  capabilities  for  embedded  tests,  AFWAL/FIGL 
has  made  the  decision  to  develop  a  new  avionics/control 


system  incorporating  a  low  cost  multiprocessor  architecture 


and  software  operating  system.  The  effort  is  being  performed 
in-house,  utilizing  years  of  multiprocessor  system  analysis, 
design,  development,  programming,  and  test  experience.  The 
first  phase  of  this  effort,  described  herein,  has  produced 
and  demonstrated  a  prototype  of  this  system.  Multiple  off- 


the-shelf  MC68000  processor  boards  have  been  combined  with  a 
VME  backplane  bus  and  wirewrapped  8061  I/O  circuitry  and 


MC68881  coprocessors  to  form  the  hardware  of  the  prototype. 
A  real  time  multiprocessor/multitasking  operating  system 


(RTMOS)  has  been  specified  and  developed  to  manage  the 


parallel  software  units  (tasks)  of  the  system. 


Interprocessor  communications  protocols,  and  a  simple 


methodology  to  use  them  to  develop  coarse  grained  parallel 


code,  have  also  been  developed. 


The  development  of  the  URV  multiprocessor 


avionics/control  system  is  not  yet  finished.  The  next  phase 


of  development  will  bring  the  system  closer  to  its 


completion  by  refining  and  enhancing  the  prototype  in 


several  areas.  The  8061  I/O  circuitry  will  be  interfaced  to 
the  VME  bus  as  originally  specified.  Additional  MC68000 


boards  will  be  utilized  to  provide  even  more  computing 
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power.  The 

RTMOS 

will 

be  fine- 

-tuned 

for 

more 

efficient 

operation. 

Task 

assignment 

will 

be 

made 

dynamic, 

distributing 

the 

task 

load  to 

the 

number 

of 

processors 

present.  Research  will  be  performed  to  specify  and  test  a 
better  multiprocessor  clock  synchronization  scheme.  The 
applications  tasks  will  be  enhanced  to  allow  for  a  seven 
surface  control  mixer  model  that  will  be  responsive  to 
multiple  surface  failures. 

In  the  longer  term,  the  system  hardware  will  need  to  be 
evaluated  and  modified  for  flight  operation.  An  extensive 
verification  procedure  will  also  be  required.  Development 
and  test  of  parallel  software  will  have  to  be  addressed  from 
an  applications  programmer  viewpoint.  High  order  languages 
will  need  to  be  applied  in  order  to  make  this  software 
manageable.  The  impacts  of  changing  airframe  configurations 
on  the  control  model  and  software  will  have  to  be  assessed. 
A  technique  to  allow  changes  to  occur  with  minimal  impact  on 
software  will  be  required. 

In  short,  much  work  remains  to  be  performed  before  the 
URV  multiprocessor  system  is  ready  for  actual 
implementation.  Still,  much  has  been  accomplished;  the 
foundation  has  been  laid.  Multiprocessor  technology  is 
beginning  to  see  application  in  many  areas.  The  low 
cost/risk  URV  research  testbed  is  one  area  where  significant 
payoffs  can  be  realized. 
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